Simulate a multisite trial data-generating process

sim_multisite() is the unified interface for the four-layer multisite-trial data-generating process. Given a multisite design — site count, per-site sizes, latent-effect distribution, and optional precision dependence — it composes the latent-effects, site-size-margin, dependence-alignment, and observation layers in one call and returns a multisitedgp_data tibble with diagnostics, provenance, and a canonical hash. This is the site-size-driven path (Paradigm A in the blueprint), in which sampling variances are induced from a site-size margin $n_j$; the sister sim_meta covers the direct-precision path (Paradigm B), in which $\widehat{se}_j^2$ is specified directly.

Usage

sim_multisite(design = NULL, ..., seed = NULL)

Arguments

design: Optional multisitedgp_design. If NULL, ... is forwarded to multisitedgp_design with paradigm = "site_size". Construct a design once with multisitedgp_design or a presets when reusing across multiple calls or a design_grid sweep.
...: Flat design arguments used only when design = NULL. See multisitedgp_design for the full parameter list. Note that paradigm cannot be passed here — the wrapper locks paradigm = "site_size"; use sim_meta for direct-precision designs.
seed: Optional integer seed override. When supplied, replaces design$seed and gives bit-identical reruns. Use a small integer (e.g. 1L) for examples; use a 9-digit integer in production for cross-run uniqueness.

Value

A multisitedgp_data tibble with one row per site and columns:

site_index: Integer site identifier $j = 1, \ldots, J$ — preserved through the pipeline (Layer 3 permutes the (se_j, se2_j, n_j) triple, never site_index).
z_j: Standardized residual effect (mean 0, variance 1).
tau_j: Latent site-level effect on the response scale, $\tau + \sigma_\tau\,z_j$.
tau_j_hat: Observed site-level estimate $\widehat{\tau}_j$.
se_j, se2_j: Site-level SE and sampling variance $\widehat{se}_j^2 = \kappa / n_j$.
n_j: Site size from the Layer 2 margin.

Plus the following attributes:

design: The locked multisitedgp_design object.
diagnostics: Group A / B / C / D diagnostics — I_hat, R_hat, realized Spearman and Pearson correlations (residual and marginal), sigma_tau realized vs. target, dependence and observation diagnostics; see compute_I and informativeness.
provenance: Package version, R version, platform, resolved seed, canonical_hash, design_hash, and the call expression.
multisitedgp_version, paradigm: Convenience copies for quick attribute lookup.

Details

The simulation runs four generative layers in order:

Layer 1 — latent effects (gen_effects): Draws standardized site effects $z_j$ from one of eight built-in $G$ distributions and rescales to $\tau_j = \tau + \sigma_\tau\,z_j$.
Layer 2 — site-level precision (gen_site_sizes): Builds the per-site sampling variance $\widehat{se}_j^2 = \kappa / n_j$ from generated site sizes $n_j$.
Layer 3 — precision dependence (align_rank_corr, align_copula_corr, align_hybrid_corr): Optionally aligns $\widehat{se}_j^2$ against $\tau_j$ to a target Spearman correlation, preserving both marginals exactly.
Layer 4 — observation draws (gen_observations): Draws the observed estimate $\widehat{\tau}_j \sim \mathcal{N}(\tau_j,\, \widehat{se}_j^2)$.

The multisitedgp_design is validated and frozen at entry, then attached to the returned multisitedgp_data alongside the diagnostics and provenance attributes. The canonical hash is stored at attr(x, "provenance")$canonical_hash (not as a top-level attribute) and is the cross-machine reproducibility identifier — two machines producing the same hash will have generated bit-identical site-level tibbles.

For a workflow walkthrough see the Getting started vignette. For the formal two-stage DGP specification, see The two-stage DGP — formal specification.

RNG policy

If seed is NULL, the pipeline runs under the caller's active RNG state and consumes the ordinary Layer 1/2/3/4 draws. No seed is manufactured. If seed is a single integer, the full pipeline is wrapped in with_seed, so the caller's global RNG state is restored on exit. The resolved seed is recorded in the provenance attribute.

References

Lee, J., Che, J., Rabe-Hesketh, S., Feller, A., & Miratrix, L. (2025). Improving the estimation of site-specific effects and their distribution in multisite trials. Journal of Educational and Behavioral Statistics, 50(5), 731–764. doi:10.3102/10769986241254286 .

Examples

# Minimal usage: a defensible preset, one call, read realized informativeness.
dat <- sim_multisite(preset_education_modest(), seed = 1L)
attr(dat, "diagnostics")$I_hat
#> [1] 0.3028032

# Full diagnostic report — realized vs. intended on every dimension.
summary(dat)
#> multisiteDGP simulation diagnostics
#> ------------------------------------------------------------
#> A. Realized vs Intended
#>    I (informativeness):         0.303  (target N/A)  N/A   [no target]
#>    R (SE heterogeneity):       10.167  (target N/A)  N/A   [no target]
#>    sigma_tau:                   0.166  (target 0.200)  FAIL  [rel=-16.9%]
#>    GM(se^2):                    0.092  (target N/A)  N/A   [no target]
#> 
#> B. Dependence
#>    rank_corr residual:          0.254  (target 0.000)  PASS  [delta=0.254]
#>    rank_corr marginal:          0.254  (target N/A)  N/A   [residual target rows only; no finite target; status not assigned]
#>    pearson_corr residual:       0.375  (target 0.000)  FAIL  [delta=0.375]
#>    pearson_corr marginal:       0.375  (target N/A)  N/A   [residual target rows only; no finite target; status not assigned]
#> 
#> C. G shape fit
#>    KS distance D_J:             0.140  (target 0.000)  PASS  [p=0.717]
#>    Bhattacharyya BC:            0.801  (target 1.000)  WARN  [rel=-19.9%]
#>    Q-Q residual:                0.731  (target 0.000)  N/A   [delta=0.731]
#> 
#> D. Operational feasibility
#>    mean shrinkage S:            0.314  (target N/A)  PASS  [no target]
#>    avg MOE (95%):               0.617  (target N/A)  WARN  [no target]
#>    feasibility_index:          15.693  (target N/A)  WARN  [no target]
#> ------------------------------------------------------------
#> Overall: 3 PASS, 3 WARN, 2 FAIL.
#> Provenance: multisiteDGP 0.1.1 | paradigm=site_size | seed=1 | canonical_hash=b36023f5aa158255 | design_hash=788d326c95d2df04 | hash_algo=xxhash64 | R=4.6.0 | hooks=none

# Provenance travels with the object for reproducibility audits.
attr(dat, "provenance")$canonical_hash
#> [1] "b36023f5aa158255"

if (FALSE) { # \dontrun{
  # Hand off to a meta-analytic estimator (requires {metafor}).
  # `as_metafor()` renames the canonical columns to metafor's (yi, vi, sei).
  metafor_obj <- as_metafor(dat)
  metafor::rma(yi = yi, vi = vi, data = metafor_obj)
} # }