Generate site sizes and sampling variances

Draw J integer site sizes \(n_j\) from a target mean and coefficient of variation, compute the per-site Neyman sampling variance \(\widehat{se}_j^2 = \kappa / n_j\), and append n_j, se2_j, and se_j columns to an upstream Layer 1 frame. This is the Layer 2 margin generator for the site-size-driven path (Paradigm A in the blueprint) — call it directly when composing the four layers manually, or let sim_multisite call it for you.

Usage

gen_site_sizes(
  upstream,
  J,
  nj_mean = 50,
  cv = 0.5,
  nj_min = 5L,
  p = 0.5,
  R2 = 0,
  var_outcome = 1,
  engine = c("A2_modern", "A1_legacy")
)

Arguments

upstream: Data frame with exactly J rows. Typically the output of gen_effects; must contain the canonical Layer 1 columns site_index, z_j, tau_j. Layer 2 columns (n_j, se_j, se2_j) must NOT be present yet.
J: Integer. Number of sites — must equal nrow(upstream).
nj_mean: Numeric (\(\ge \mathrm{nj\_min}\)). Target site-size mean on the engine scale. Default 50. Typical applied range: 20–500.
cv: Numeric (\(\ge 0\)). Target site-size coefficient of variation. Default 0.50. Use cv = 0 for equal-size sites; cv = 0.5 for the JEBS reference range. Larger cv produces more heterogeneous sites.
nj_min: Integer (\(\ge 1\)). Lower bound for public site sizes. Default 5. The engine output is floored at this bound.
p: Numeric in (0, 1). Treatment-assignment proportion. Default 0.5 (balanced). Affects \(\kappa\) through Neyman allocation.
R2: Numeric in [0, 1). Covariate-explained variance share at the site level. Default 0. Decreases \(\kappa\) and improves precision through the multiplier \(1 - R^2\).
var_outcome: Numeric (> 0). Outcome variance. Default 1. Scales \(\kappa\) linearly.
engine: Character. "A2_modern" (default — recommended) or "A1_legacy" (JEBS bit-parity reproduction only).

Value

The upstream tibble with three appended columns: n_j (integer site size), se2_j (numeric sampling variance \(\kappa / n_j\)), and se_j (numeric SE \(\sqrt{se2_j}\)). Two attributes are attached: engine (the resolved engine name) and kappa (the Neyman precision constant).

Details

Engine choice. Two engines back the site-size draw:

"A2_modern" (default — recommended for new work): Lower-truncated Gamma on the continuous target scale, then stochastic rounding to integer n_j. Preserves the target mean in expectation and matches cv exactly on the underlying continuous draw.
"A1_legacy": The JEBS paper's censor-then-round procedure. Preserved for bit-identical reproduction of the JEBS reference design and its replication fixtures. Can inflate the empirical mean near nj_min through censoring; not recommended for new work.

Pick A2 unless you are explicitly trying to reproduce a JEBS fixture. The A1 engine is also restricted: combining A1 with non-trivial precision dependence (dependence != "none") is refused by validate_multisitedgp_design — A1 is for legacy reproduction only.

Sampling variance. The per-site Neyman variance is \(\kappa / n_j\) with \(\kappa = \mathrm{var\_outcome}(1 - R^2) / (p (1 - p))\), the standard Neyman-allocation precision constant. Pass p, R2, and var_outcome to control \(\kappa\) explicitly; defaults (p = 0.5, R2 = 0, var_outcome = 1) give \(\kappa = 4\), the baseline used in the JEBS paper.

For the formal Paradigm A vs Paradigm B contrast and the engine derivation, see the Margin and SE models — site-size and direct-precision paths vignette.

RNG policy

Stochastic rounding (A2 only) consumes one runif() draw for each non-integer engine output. All-integer engine output, including the cv = 0 deterministic path, consumes no rounding RNG. The engine itself (Gamma draw under A2 or A1) consumes the usual rgamma() stream.

References

Lee, J., Che, J., Rabe-Hesketh, S., Feller, A., & Miratrix, L. (2025). Improving the estimation of site-specific effects and their distribution in multisite trials. Journal of Educational and Behavioral Statistics, 50(5), 731–764. doi:10.3102/10769986241254286 .

Examples

# Compose Layer 1 + Layer 2 manually.
effects <- gen_effects_gaussian(J = 10L)
gen_site_sizes(effects, J = 10L, nj_mean = 40, cv = 0.2)
#> # A tibble: 10 × 6
#>    site_index    z_j   tau_j   n_j  se2_j  se_j
#>         <int>  <dbl>   <dbl> <int>  <dbl> <dbl>
#>  1          1 -0.130 -0.0261    34 0.118  0.343
#>  2          2  0.951  0.190     37 0.108  0.329
#>  3          3  0.471  0.0942    34 0.118  0.343
#>  4          4  0.335  0.0670    45 0.0889 0.298
#>  5          5 -1.55  -0.310     37 0.108  0.329
#>  6          6 -0.621 -0.124     48 0.0833 0.289
#>  7          7 -1.62  -0.323     47 0.0851 0.292
#>  8          8  0.853  0.171     52 0.0769 0.277
#>  9          9  1.89   0.378     30 0.133  0.365
#> 10         10 -0.867 -0.173     44 0.0909 0.302

# Larger draw with the JEBS reference cv = 0.5 and Neyman defaults.
effects50 <- gen_effects_gaussian(J = 50L, sigma_tau = 0.15)
sized <- gen_site_sizes(effects50, J = 50L, nj_mean = 50, cv = 0.5)
summary(sized$n_j)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   10.00   29.50   49.00   49.96   69.75  109.00 
summary(sized$se_j)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.1916  0.2395  0.2857  0.3198  0.3683  0.6325 

# JEBS bit-parity reproduction — engine A1.
a1 <- gen_site_sizes(effects, J = 10L, nj_mean = 40, cv = 0.5,
                     engine = "A1_legacy")
attr(a1, "engine")  # "A1_legacy"
#> [1] "A1_legacy"