Skip to contents

sim_multisite() is the unified interface for the four-layer multisite-trial data-generating process. Given a multisite design — site count, per-site sizes, latent-effect distribution, and optional precision dependence — it composes the latent-effects, site-size-margin, dependence-alignment, and observation layers in one call and returns a multisitedgp_data tibble with diagnostics, provenance, and a canonical hash. This is the site-size-driven path (Paradigm A in the blueprint), in which sampling variances are induced from a site-size margin \(n_j\); the sister sim_meta covers the direct-precision path (Paradigm B), in which \(\widehat{se}_j^2\) is specified directly.

Usage

sim_multisite(design = NULL, ..., seed = NULL)

Arguments

design

Optional multisitedgp_design. If NULL, ... is forwarded to multisitedgp_design with paradigm = "site_size". Construct a design once with multisitedgp_design or a presets when reusing across multiple calls or a design_grid sweep.

...

Flat design arguments used only when design = NULL. See multisitedgp_design for the full parameter list. Note that paradigm cannot be passed here — the wrapper locks paradigm = "site_size"; use sim_meta for direct-precision designs.

seed

Optional integer seed override. When supplied, replaces design$seed and gives bit-identical reruns. Use a small integer (e.g. 1L) for examples; use a 9-digit integer in production for cross-run uniqueness.

Value

A multisitedgp_data tibble with one row per site and columns:

site_index

Integer site identifier \(j = 1, \ldots, J\) — preserved through the pipeline (Layer 3 permutes the (se_j, se2_j, n_j) triple, never site_index).

z_j

Standardized residual effect (mean 0, variance 1).

tau_j

Latent site-level effect on the response scale, \(\tau + \sigma_\tau\,z_j\).

tau_j_hat

Observed site-level estimate \(\widehat{\tau}_j\).

se_j, se2_j

Site-level SE and sampling variance \(\widehat{se}_j^2 = \kappa / n_j\).

n_j

Site size from the Layer 2 margin.

Plus the following attributes:

design

The locked multisitedgp_design object.

diagnostics

Group A / B / C / D diagnostics — I_hat, R_hat, realized Spearman and Pearson correlations (residual and marginal), sigma_tau realized vs. target, dependence and observation diagnostics; see compute_I and informativeness.

provenance

Package version, R version, platform, resolved seed, canonical_hash, design_hash, and the call expression.

multisitedgp_version, paradigm

Convenience copies for quick attribute lookup.

Details

The simulation runs four generative layers in order:

Layer 1 — latent effects (gen_effects)

Draws standardized site effects \(z_j\) from one of eight built-in \(G\) distributions and rescales to \(\tau_j = \tau + \sigma_\tau\,z_j\).

Layer 2 — site-level precision (gen_site_sizes)

Builds the per-site sampling variance \(\widehat{se}_j^2 = \kappa / n_j\) from generated site sizes \(n_j\).

Layer 3 — precision dependence (align_rank_corr, align_copula_corr, align_hybrid_corr)

Optionally aligns \(\widehat{se}_j^2\) against \(\tau_j\) to a target Spearman correlation, preserving both marginals exactly.

Layer 4 — observation draws (gen_observations)

Draws the observed estimate \(\widehat{\tau}_j \sim \mathcal{N}(\tau_j,\, \widehat{se}_j^2)\).

The multisitedgp_design is validated and frozen at entry, then attached to the returned multisitedgp_data alongside the diagnostics and provenance attributes. The canonical hash is stored at attr(x, "provenance")$canonical_hash (not as a top-level attribute) and is the cross-machine reproducibility identifier — two machines producing the same hash will have generated bit-identical site-level tibbles.

For a workflow walkthrough see the Getting started vignette. For the formal two-stage DGP specification, see The two-stage DGP — formal specification.

RNG policy

If seed is NULL, the pipeline runs under the caller's active RNG state and consumes the ordinary Layer 1/2/3/4 draws. No seed is manufactured. If seed is a single integer, the full pipeline is wrapped in with_seed, so the caller's global RNG state is restored on exit. The resolved seed is recorded in the provenance attribute.

References

Lee, J., Che, J., Rabe-Hesketh, S., Feller, A., & Miratrix, L. (2025). Improving the estimation of site-specific effects and their distribution in multisite trials. Journal of Educational and Behavioral Statistics, 50(5), 731–764. doi:10.3102/10769986241254286 .

See also

sim_meta for the direct-precision (Paradigm B) sister wrapper that takes precision targets in place of a site-size margin; multisitedgp_design for explicit design construction and validation; the presets family for defensible starting designs; design_grid for scenario-grid sweeps; gen_effects, gen_site_sizes, align_hybrid_corr, gen_observations for the four layers exposed individually; the Getting started vignette.

Other family-wrappers: sim_meta()

Examples

# Minimal usage: a defensible preset, one call, read realized informativeness.
dat <- sim_multisite(preset_education_modest(), seed = 1L)
attr(dat, "diagnostics")$I_hat
#> [1] 0.3028032

# Full diagnostic report — realized vs. intended on every dimension.
summary(dat)
#> multisiteDGP simulation diagnostics
#> ------------------------------------------------------------
#> A. Realized vs Intended
#>    I (informativeness):         0.303  (target N/A)  N/A   [no target]
#>    R (SE heterogeneity):       10.167  (target N/A)  N/A   [no target]
#>    sigma_tau:                   0.166  (target 0.200)  FAIL  [rel=-16.9%]
#>    GM(se^2):                    0.092  (target N/A)  N/A   [no target]
#> 
#> B. Dependence
#>    rank_corr residual:          0.254  (target 0.000)  PASS  [delta=0.254]
#>    rank_corr marginal:          0.254  (target N/A)  N/A   [residual target rows only; no finite target; status not assigned]
#>    pearson_corr residual:       0.375  (target 0.000)  FAIL  [delta=0.375]
#>    pearson_corr marginal:       0.375  (target N/A)  N/A   [residual target rows only; no finite target; status not assigned]
#> 
#> C. G shape fit
#>    KS distance D_J:             0.140  (target 0.000)  PASS  [p=0.717]
#>    Bhattacharyya BC:            0.801  (target 1.000)  WARN  [rel=-19.9%]
#>    Q-Q residual:                0.731  (target 0.000)  N/A   [delta=0.731]
#> 
#> D. Operational feasibility
#>    mean shrinkage S:            0.314  (target N/A)  PASS  [no target]
#>    avg MOE (95%):               0.617  (target N/A)  WARN  [no target]
#>    feasibility_index:          15.693  (target N/A)  WARN  [no target]
#> ------------------------------------------------------------
#> Overall: 3 PASS, 3 WARN, 2 FAIL.
#> Provenance: multisiteDGP 0.1.1 | paradigm=site_size | seed=1 | canonical_hash=b36023f5aa158255 | design_hash=788d326c95d2df04 | hash_algo=xxhash64 | R=4.6.0 | hooks=none

# Provenance travels with the object for reproducibility audits.
attr(dat, "provenance")$canonical_hash
#> [1] "b36023f5aa158255"

if (FALSE) { # \dontrun{
  # Hand off to a meta-analytic estimator (requires {metafor}).
  # `as_metafor()` renames the canonical columns to metafor's (yi, vi, sei).
  metafor_obj <- as_metafor(dat)
  metafor::rma(yi = yi, vi = vi, data = metafor_obj)
} # }