multisiteDGP: Data-generating processes for multisite trial simulations
Source:R/multisitedgp-package.R
multisiteDGP-package.RdGenerate a complete multisite-trial summary dataset — latent site effects, sampling variances, optional precision dependence between effects and precisions, and observed site estimates — in one call. Use this when you are designing a multisite-trial power analysis, choosing between meta-analytic estimators, or sweeping a scenario grid before committing to a long simulation run. The eight built-in effect distributions and the literature-calibrated presets let you anchor a simulation to the applied evidence base, and bundled diagnostics, plots, and adapters carry the result through to downstream analysis.
Details
The generative model is a transparent four-layer pipeline. Each layer has a single responsibility and forwards a documented schema downstream:
- Layer 1 — latent effects (
gen_effects) Draws standardized site effects \(z_j\) from one of eight built-in \(G\) distributions — Gaussian, Student-t, skew-normal, asymmetric Laplace, two-component mixture, point-mass slab, a user callback, and a Dirichlet-process-mixture bridge — and rescales to the response-scale latent effect \(\tau_j = \tau + \sigma_\tau\,z_j\).
- Layer 2 — site-level precision
(
gen_site_sizes,gen_se_direct) Builds each site's sampling variance \(\widehat{se}_j^2\) either from generated site sizes \(n_j\) (the site-size-driven path) or from direct precision-scale targets — mean \(\widehat{se}^2\) and the heterogeneity ratio \(R\) — for the direct-precision path.
- Layer 3 — precision dependence
(
align_rank_corr,align_copula_corr,align_hybrid_corr) Optionally aligns \(\widehat{se}_j^2\) against \(\tau_j\) to a target Spearman correlation through one of three injection methods (rank hill-climb, Gaussian copula, or hybrid), preserving both marginals exactly.
- Layer 4 — observation draws
(
gen_observations) Draws the observed estimate \(\widehat{\tau}_j \mid \tau_j, \widehat{se}_j^2 \sim \mathcal{N}(\tau_j, \widehat{se}_j^2)\).
Two front doors compose the four layers in one call.
sim_multisite drives the precision margin from site
sizes — the site-size-driven path (Paradigm A in the blueprint) — and
matches how applied multisite-trial designs are usually specified.
sim_meta takes the precision targets directly — the
direct-precision path (Paradigm B) — and matches how meta-analysis
simulations are usually specified. Both return a
multisitedgp_data tibble built from an immutable
multisitedgp_design, with diagnostics and
provenance attached as attributes; the canonical hash is stored
inside attr(x, "provenance")$canonical_hash.
On top of the pipeline, the package ships diagnostics covering Dr.
Chen's four questions (precision and feasibility, realized
dependence, distributional fit, downstream shrinkage), presets
calibrated to published education and labor-economics evidence, plots
for caterpillar / funnel / dependence views, adapters that emit data
ready for metafor, baggr, and multisitepower, and
reproducibility helpers (canonical_hash,
provenance_string) for cross-machine replication.
Vignettes
- Applied Track
- Methodological Track
Function families
- Front doors
- Design and data objects
multisitedgp_designconstructor withvalidate_multisitedgp_design,update_multisitedgp_design,is_multisitedgp_design,is_multisitedgp_data, andas_tibble.multisitedgp_data.- Effect distributions (Layer 1)
gen_effectsdispatcher and the eight shape-specific generatorsgen_effects_gaussian,gen_effects_studentt,gen_effects_skewn,gen_effects_ald,gen_effects_mixture,gen_effects_pmslab,gen_effects_user,gen_effects_dpm.- Margin / SE models (Layer 2)
gen_site_sizesfor the site-size margin;gen_se_directfor the direct standard-error margin.- Precision dependence (Layer 3)
- Observation draws (Layer 4)
- Diagnostics
Group A precision and feasibility scalars (
compute_kappa,compute_I,informativeness,compute_shrinkage,mean_shrinkage,feasibility_index,default_thresholds,heterogeneity_ratio); Group B realized dependence (realized_rank_corr,realized_rank_corr_marginal); Group C distributional fidelity (bhattacharyya_coef,ks_distance); top-of-funnel scenario sweep (scenario_audit).- Presets
preset_education_small,preset_education_modest,preset_education_substantial,preset_jebs_paper,preset_jebs_strict,preset_meta_modest,preset_small_area_estimation,preset_twin_towers,preset_walters_2024.- Output adapters
- Visualization
- Reproducibility
Error catalog
Errors raised by the package carry a typed S3 class hierarchy so
that calling code can branch on the error category. Every error
inherits from multisitedgp_error; six concrete subclasses
name the failure category.
multisitedgp_errorBase class. Inherit-test with
inherits(e, "multisitedgp_error")to catch any package-typed error.multisitedgp_arg_errorArgument validation. Raised when a user-facing argument is missing, the wrong type, out of range, or exclusive with another argument.
multisitedgp_coherence_errorDesign-level coherence violations across layers. Raised when a combination of arguments is individually valid but jointly inconsistent (for example, the residual / marginal target pair specified incompatibly).
multisitedgp_engine_dependence_errorEngine compatibility. Raised when the legacy
A1engine is paired with a non-zero Layer 3 dependence target, the documented Decision-C constraint.multisitedgp_solver_errorSolver failure. Raised when an internal numerical solver (for example, the direct-precision back-calculation) fails to converge.
multisitedgp_dependence_solver_errorLayer 3 alignment-solver failure. Raised when
align_rank_corr,align_copula_corr, oralign_hybrid_corrcannot reach the requested target within tolerance.multisitedgp_marginal_violation_errorMultiset preservation violation. Raised when a Layer 3 alignment pass would change the empirical marginal distribution of \(z_j\) or \(\widehat{se}_j^2\) (the package's marginal- preservation invariant).
Every condition object also carries a one-sentence message,
a rlang::format_error_bullets() body explaining what
happened, and a fix field with a concrete next step.
Funding
This research was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D240078 to the University of Alabama. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.
References
Lee, J., Che, J., Rabe-Hesketh, S., Feller, A., & Miratrix, L. (2025). Improving the estimation of site-specific effects and their distribution in multisite trials. Journal of Educational and Behavioral Statistics, 50(5), 731–764. doi:10.3102/10769986241254286 .
Walters, C. (2024). Empirical Bayes methods in labor economics. In Handbook of Labor Economics (Vol. 5, pp. 183–260). Elsevier. doi:10.1016/bs.heslab.2024.11.001 .
Weiss, M. J., Bloom, H. S., Verbitsky-Savitz, N., Gupta, H., Vigil, A. E., & Cullinan, D. N. (2017). How much do the effects of education and training programs vary across sites? Evidence from past multisite randomized trials. Journal of Research on Educational Effectiveness, 10(4), 843–876. doi:10.1080/19345747.2017.1300719 .
Author
Maintainer: JoonHo Lee jlee296@ua.edu (ORCID)