Simulate a multisite trial data-generating process
Source:R/wrapper-sim_multisite.R
sim_multisite.Rdsim_multisite() is the unified interface for the four-layer
multisite-trial data-generating process. Given a multisite design — site
count, per-site sizes, latent-effect distribution, and optional precision
dependence — it composes the latent-effects, site-size-margin,
dependence-alignment, and observation layers in one call and returns a
multisitedgp_data tibble with diagnostics, provenance, and a canonical
hash. This is the site-size-driven path (Paradigm A in the blueprint),
in which sampling variances are induced from a site-size margin
\(n_j\); the sister sim_meta covers the
direct-precision path (Paradigm B), in which \(\widehat{se}_j^2\) is
specified directly.
Arguments
- design
Optional
multisitedgp_design. IfNULL,...is forwarded tomultisitedgp_designwithparadigm = "site_size". Construct a design once withmultisitedgp_designor apresetswhen reusing across multiple calls or adesign_gridsweep.- ...
Flat design arguments used only when
design = NULL. Seemultisitedgp_designfor the full parameter list. Note thatparadigmcannot be passed here — the wrapper locksparadigm = "site_size"; usesim_metafor direct-precision designs.- seed
Optional integer seed override. When supplied, replaces
design$seedand gives bit-identical reruns. Use a small integer (e.g.1L) for examples; use a 9-digit integer in production for cross-run uniqueness.
Value
A multisitedgp_data tibble with one row per site and columns:
site_indexInteger site identifier \(j = 1, \ldots, J\) — preserved through the pipeline (Layer 3 permutes the
(se_j, se2_j, n_j)triple, neversite_index).z_jStandardized residual effect (mean 0, variance 1).
tau_jLatent site-level effect on the response scale, \(\tau + \sigma_\tau\,z_j\).
tau_j_hatObserved site-level estimate \(\widehat{\tau}_j\).
se_j,se2_jSite-level SE and sampling variance \(\widehat{se}_j^2 = \kappa / n_j\).
n_jSite size from the Layer 2 margin.
Plus the following attributes:
designThe locked
multisitedgp_designobject.diagnosticsGroup A / B / C / D diagnostics —
I_hat,R_hat, realized Spearman and Pearson correlations (residual and marginal),sigma_taurealized vs. target, dependence and observation diagnostics; seecompute_Iandinformativeness.provenancePackage version, R version, platform, resolved seed,
canonical_hash,design_hash, and the call expression.multisitedgp_version,paradigmConvenience copies for quick attribute lookup.
Details
The simulation runs four generative layers in order:
- Layer 1 — latent effects (
gen_effects) Draws standardized site effects \(z_j\) from one of eight built-in \(G\) distributions and rescales to \(\tau_j = \tau + \sigma_\tau\,z_j\).
- Layer 2 — site-level precision (
gen_site_sizes) Builds the per-site sampling variance \(\widehat{se}_j^2 = \kappa / n_j\) from generated site sizes \(n_j\).
- Layer 3 — precision dependence (
align_rank_corr,align_copula_corr,align_hybrid_corr) Optionally aligns \(\widehat{se}_j^2\) against \(\tau_j\) to a target Spearman correlation, preserving both marginals exactly.
- Layer 4 — observation draws (
gen_observations) Draws the observed estimate \(\widehat{\tau}_j \sim \mathcal{N}(\tau_j,\, \widehat{se}_j^2)\).
The multisitedgp_design is validated and frozen at entry, then attached
to the returned multisitedgp_data alongside the diagnostics and
provenance attributes. The canonical hash is stored at
attr(x, "provenance")$canonical_hash (not as a top-level attribute) and
is the cross-machine reproducibility identifier — two machines producing
the same hash will have generated bit-identical site-level tibbles.
For a workflow walkthrough see the Getting started vignette. For the formal two-stage DGP specification, see The two-stage DGP — formal specification.
RNG policy
If seed is NULL, the pipeline runs under the caller's active RNG state
and consumes the ordinary Layer 1/2/3/4 draws. No seed is manufactured. If
seed is a single integer, the full pipeline is wrapped in
with_seed, so the caller's global RNG state is
restored on exit. The resolved seed is recorded in the provenance
attribute.
References
Lee, J., Che, J., Rabe-Hesketh, S., Feller, A., & Miratrix, L. (2025). Improving the estimation of site-specific effects and their distribution in multisite trials. Journal of Educational and Behavioral Statistics, 50(5), 731–764. doi:10.3102/10769986241254286 .
See also
sim_meta for the direct-precision (Paradigm B) sister
wrapper that takes precision targets in place of a site-size margin;
multisitedgp_design for explicit design construction and
validation;
the presets family for defensible starting designs;
design_grid for scenario-grid sweeps;
gen_effects, gen_site_sizes,
align_hybrid_corr, gen_observations for the
four layers exposed individually;
the Getting started vignette.
Other family-wrappers:
sim_meta()
Examples
# Minimal usage: a defensible preset, one call, read realized informativeness.
dat <- sim_multisite(preset_education_modest(), seed = 1L)
attr(dat, "diagnostics")$I_hat
#> [1] 0.3028032
# Full diagnostic report — realized vs. intended on every dimension.
summary(dat)
#> multisiteDGP simulation diagnostics
#> ------------------------------------------------------------
#> A. Realized vs Intended
#> I (informativeness): 0.303 (target N/A) N/A [no target]
#> R (SE heterogeneity): 10.167 (target N/A) N/A [no target]
#> sigma_tau: 0.166 (target 0.200) FAIL [rel=-16.9%]
#> GM(se^2): 0.092 (target N/A) N/A [no target]
#>
#> B. Dependence
#> rank_corr residual: 0.254 (target 0.000) PASS [delta=0.254]
#> rank_corr marginal: 0.254 (target N/A) N/A [residual target rows only; no finite target; status not assigned]
#> pearson_corr residual: 0.375 (target 0.000) FAIL [delta=0.375]
#> pearson_corr marginal: 0.375 (target N/A) N/A [residual target rows only; no finite target; status not assigned]
#>
#> C. G shape fit
#> KS distance D_J: 0.140 (target 0.000) PASS [p=0.717]
#> Bhattacharyya BC: 0.801 (target 1.000) WARN [rel=-19.9%]
#> Q-Q residual: 0.731 (target 0.000) N/A [delta=0.731]
#>
#> D. Operational feasibility
#> mean shrinkage S: 0.314 (target N/A) PASS [no target]
#> avg MOE (95%): 0.617 (target N/A) WARN [no target]
#> feasibility_index: 15.693 (target N/A) WARN [no target]
#> ------------------------------------------------------------
#> Overall: 3 PASS, 3 WARN, 2 FAIL.
#> Provenance: multisiteDGP 0.1.1 | paradigm=site_size | seed=1 | canonical_hash=b36023f5aa158255 | design_hash=788d326c95d2df04 | hash_algo=xxhash64 | R=4.6.0 | hooks=none
# Provenance travels with the object for reproducibility audits.
attr(dat, "provenance")$canonical_hash
#> [1] "b36023f5aa158255"
if (FALSE) { # \dontrun{
# Hand off to a meta-analytic estimator (requires {metafor}).
# `as_metafor()` renames the canonical columns to metafor's (yi, vi, sei).
metafor_obj <- as_metafor(dat)
metafor::rma(yi = yi, vi = vi, data = metafor_obj)
} # }