multisiteDGP: Data-generating processes for multisite trial simulations

Generate a complete multisite-trial summary dataset — latent site effects, sampling variances, optional precision dependence between effects and precisions, and observed site estimates — in one call. Use this when you are designing a multisite-trial power analysis, choosing between meta-analytic estimators, or sweeping a scenario grid before committing to a long simulation run. The eight built-in effect distributions and the literature-calibrated presets let you anchor a simulation to the applied evidence base, and bundled diagnostics, plots, and adapters carry the result through to downstream analysis.

Details

The generative model is a transparent four-layer pipeline. Each layer has a single responsibility and forwards a documented schema downstream:

Layer 1 — latent effects (gen_effects): Draws standardized site effects $z_j$ from one of eight built-in $G$ distributions — Gaussian, Student-t, skew-normal, asymmetric Laplace, two-component mixture, point-mass slab, a user callback, and a Dirichlet-process-mixture bridge — and rescales to the response-scale latent effect $\tau_j = \tau + \sigma_\tau\,z_j$.
Layer 2 — site-level precision (gen_site_sizes, gen_se_direct): Builds each site's sampling variance $\widehat{se}_j^2$ either from generated site sizes $n_j$ (the site-size-driven path) or from direct precision-scale targets — mean $\widehat{se}^2$ and the heterogeneity ratio $R$ — for the direct-precision path.
Layer 3 — precision dependence (align_rank_corr, align_copula_corr, align_hybrid_corr): Optionally aligns $\widehat{se}_j^2$ against $\tau_j$ to a target Spearman correlation through one of three injection methods (rank hill-climb, Gaussian copula, or hybrid), preserving both marginals exactly.
Layer 4 — observation draws (gen_observations): Draws the observed estimate $\widehat{\tau}_j \mid \tau_j, \widehat{se}_j^2 \sim \mathcal{N}(\tau_j, \widehat{se}_j^2)$.

Two front doors compose the four layers in one call. sim_multisite drives the precision margin from site sizes — the site-size-driven path (Paradigm A in the blueprint) — and matches how applied multisite-trial designs are usually specified. sim_meta takes the precision targets directly — the direct-precision path (Paradigm B) — and matches how meta-analysis simulations are usually specified. Both return a multisitedgp_data tibble built from an immutable multisitedgp_design, with diagnostics and provenance attached as attributes; the canonical hash is stored inside attr(x, "provenance")$canonical_hash.

On top of the pipeline, the package ships diagnostics covering Dr. Chen's four questions (precision and feasibility, realized dependence, distributional fit, downstream shrinkage), presets calibrated to published education and labor-economics evidence, plots for caterpillar / funnel / dependence views, adapters that emit data ready for metafor, baggr, and multisitepower, and reproducibility helpers (canonical_hash, provenance_string) for cross-machine replication.

Vignettes

Applied Track

Methodological Track

Function families

Front doors: sim_multisite, sim_meta, design_grid.
Design and data objects: multisitedgp_design constructor with validate_multisitedgp_design, update_multisitedgp_design, is_multisitedgp_design, is_multisitedgp_data, and as_tibble.multisitedgp_data.
Effect distributions (Layer 1): gen_effects dispatcher and the eight shape-specific generators gen_effects_gaussian, gen_effects_studentt, gen_effects_skewn, gen_effects_ald, gen_effects_mixture, gen_effects_pmslab, gen_effects_user, gen_effects_dpm.
Margin / SE models (Layer 2): gen_site_sizes for the site-size margin; gen_se_direct for the direct standard-error margin.
Precision dependence (Layer 3): align_rank_corr, align_copula_corr, align_hybrid_corr.
Observation draws (Layer 4): gen_observations.
Diagnostics: Group A precision and feasibility scalars (compute_kappa, compute_I, informativeness, compute_shrinkage, mean_shrinkage, feasibility_index, default_thresholds, heterogeneity_ratio); Group B realized dependence (realized_rank_corr, realized_rank_corr_marginal); Group C distributional fidelity (bhattacharyya_coef, ks_distance); top-of-funnel scenario sweep (scenario_audit).
Presets: preset_education_small, preset_education_modest, preset_education_substantial, preset_jebs_paper, preset_jebs_strict, preset_meta_modest, preset_small_area_estimation, preset_twin_towers, preset_walters_2024.
Output adapters: as_metafor, as_baggr, as_multisitepower.
Visualization: plot_effects, plot_funnel, plot_dependence.
Reproducibility: canonical_hash, provenance_string.

Error catalog

Errors raised by the package carry a typed S3 class hierarchy so that calling code can branch on the error category. Every error inherits from multisitedgp_error; six concrete subclasses name the failure category.

multisitedgp_error: Base class. Inherit-test with inherits(e, "multisitedgp_error") to catch any package-typed error.
multisitedgp_arg_error: Argument validation. Raised when a user-facing argument is missing, the wrong type, out of range, or exclusive with another argument.
multisitedgp_coherence_error: Design-level coherence violations across layers. Raised when a combination of arguments is individually valid but jointly inconsistent (for example, the residual / marginal target pair specified incompatibly).
multisitedgp_engine_dependence_error: Engine compatibility. Raised when the legacy A1 engine is paired with a non-zero Layer 3 dependence target, the documented Decision-C constraint.
multisitedgp_solver_error: Solver failure. Raised when an internal numerical solver (for example, the direct-precision back-calculation) fails to converge.
multisitedgp_dependence_solver_error: Layer 3 alignment-solver failure. Raised when align_rank_corr, align_copula_corr, or align_hybrid_corr cannot reach the requested target within tolerance.
multisitedgp_marginal_violation_error: Multiset preservation violation. Raised when a Layer 3 alignment pass would change the empirical marginal distribution of $z_j$ or $\widehat{se}_j^2$ (the package's marginal- preservation invariant).

Every condition object also carries a one-sentence message, a rlang::format_error_bullets() body explaining what happened, and a fix field with a concrete next step.

Funding

This research was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D240078 to the University of Alabama. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.

References

Lee, J., Che, J., Rabe-Hesketh, S., Feller, A., & Miratrix, L. (2025). Improving the estimation of site-specific effects and their distribution in multisite trials. Journal of Educational and Behavioral Statistics, 50(5), 731–764. doi:10.3102/10769986241254286 .

Walters, C. (2024). Empirical Bayes methods in labor economics. In Handbook of Labor Economics (Vol. 5, pp. 183–260). Elsevier. doi:10.1016/bs.heslab.2024.11.001 .

Weiss, M. J., Bloom, H. S., Verbitsky-Savitz, N., Gupta, H., Vigil, A. E., & Cullinan, D. N. (2017). How much do the effects of education and training programs vary across sites? Evidence from past multisite randomized trials. Journal of Research on Educational Effectiveness, 10(4), 843–876. doi:10.1080/19345747.2017.1300719 .

Author

Maintainer: JoonHo Lee jlee296@ua.edu (ORCID)