Generate latent site effects

gen_effects() is the Layer 1 entry point of the multisiteDGP pipeline — it draws standardized site effects $z_j$ from one of eight built-in $G$ distributions (or a user callback) and returns a forward-compatible tibble that Layers 2 through 4 can consume. Most users invoke it indirectly through sim_multisite or sim_meta; call it directly when composing the four layers manually or auditing a single layer in isolation. Shape selection is controlled by true_dist, shape-specific parameters travel through theta_G, and a user callback g_fn overrides the catalog when true_dist = "User".

Usage

gen_effects(
  J,
  true_dist = c("Gaussian", "StudentT", "SkewN", "ALD", "Mixture", "PointMassSlab",
    "User", "DPM"),
  tau = 0,
  sigma_tau = 0.2,
  variance = 1,
  theta_G = list(),
  formula = NULL,
  beta = NULL,
  data = NULL,
  g_fn = NULL,
  g_returns = c("standardized", "raw"),
  audit_g = TRUE,
  upstream = NULL
)

Arguments

J: Integer. Number of sites.
true_dist: Character. One of "Gaussian", "StudentT", "SkewN", "ALD", "Mixture", "PointMassSlab", "User", or "DPM". Default "Gaussian". If g_fn is supplied without true_dist, the package auto-selects "User".
tau: Numeric. Grand mean on the response scale. Default 0.
sigma_tau: Numeric ($\ge 0$). Between-site standard deviation on the response scale (not variance). Default 0.20.
variance: Numeric. Legacy Gaussian variance argument. Default 1. The unit-variance convention requires variance = 1; other shapes ignore it.
theta_G: Named list of shape-specific parameters. Keys vary by true_dist; see the eight-shape catalog above.
formula: One-sided formula for site-level covariates (e.g., ~ x1 + x2), or NULL.
beta: Numeric coefficient vector matching the columns of the model matrix built from formula, or NULL.
data: A data.frame containing the predictors named in formula, or NULL.
g_fn: Optional user callback for true_dist = "User" (or for the "DPM" bridge). Receives J and returns a length-J numeric vector.
g_returns: Character. "standardized" (Convention A, default) — the callback returns standardized residuals $z_j$ and the package rescales. "raw" (Convention B) — the callback returns response-scale effects and the package does not rescale.
audit_g: Logical. When g_returns = "standardized", validate that the callback draws meet the unit-moment contract. Default TRUE. Has no effect under g_returns = "raw".
upstream: Reserved for future layer composition. Leave NULL (the default); passing a non-NULL value aborts.

Value

A tibble with one row per site:

site_index: Integer 1..J — preserved through downstream layers.
z_j: Standardized residual effect — mean 0, variance 1 by construction.
tau_j: Response-scale latent effect, $\tau + X_j\boldsymbol{\beta} + \sigma_\tau\,z_j$.
<covariate columns>: Pass-through from data if formula was non-NULL.
latent_component: Character; for true_dist = "Mixture", names which mixture component each row was drawn from. Absent for the other seven shapes.

The tibble carries no S3 class beyond tbl_df — Layer 2 functions add the package's classes on top.

Details

The eight built-in $G$ distributions are:

"Gaussian" (gen_effects_gaussian): Standard normal — the canonical baseline. No theta_G keys.
"StudentT" (gen_effects_studentt): Standardized Student-$t$ with degrees of freedom theta_G$nu (numeric, > 2). Heavier tails than Gaussian.
"SkewN" (gen_effects_skewn): Standardized skew-normal with slant theta_G$slant (numeric). Asymmetric shape.
"ALD" (gen_effects_ald): Standardized asymmetric Laplace with asymmetry theta_G$rho $\in (0, 1)$.
"Mixture" (gen_effects_mixture): Two-component normal mixture with theta_G$delta (component separation), theta_G$eps (mixing weight), theta_G$ups (variance ratio). Use for bimodal or contaminated effects.
"PointMassSlab" (gen_effects_pmslab): Point mass at 0 with probability theta_G$pi0, plus a continuous slab governed by theta_G$slab_shape, theta_G$mu_slab, theta_G$sigma_slab. Use when a fraction of sites have null effects.
"User" (gen_effects_user): Any user callback g_fn returning length-J standardized residuals (or raw response-scale effects under g_returns = "raw").
"DPM" (gen_effects_dpm): Dirichlet-process mixture — currently available only via the g_fn callback bridge. Direct DPM is unimplemented in the current release.

When to call this directly. For most users, sim_multisite or sim_meta is the right entry point — direct calls to gen_effects() are an advanced surface. The three situations that warrant a direct call are: composing the four layers manually to inspect or modify the Layer 1 → Layer 2 contract; auditing a suspected downstream diagnostic by verifying Layer 1 in isolation; and testing a g_fn callback's output before plugging it into the full simulation.

Unit-variance convention. All eight shapes share a unit-variance standardization: the package draws $z_j$ with $E[z_j] = 0$ and $\mathrm{Var}(z_j) = 1$, then rescales to $\tau_j = \tau + X_j\boldsymbol{\beta} + \sigma_\tau\,z_j$. This makes sigma_tau a single comparable knob across shapes — heterogeneity targets mean the same thing whether true_dist = "Gaussian" or true_dist = "ALD".

Convention A vs Convention B (user callbacks). Under g_returns = "standardized" (Convention A, the default) the callback returns standardized residuals $z_j$; the package rescales by sigma_tau and adds tau (and any covariate adjustment) to form tau_j. Under g_returns = "raw" (Convention B) the callback returns the response-scale effect directly; the package leaves it untouched. Convention A integrates with downstream diagnostics (notably informativeness and heterogeneity_ratio) without further work; Convention B is for callbacks where standardization is meaningless or undesirable. See gen_effects_user.

Covariate adjustment. When formula is non-NULL, a model matrix $X$ is built from data and combined with beta to form $X_j\boldsymbol{\beta}$, which enters the linear predictor for $\tau_j$ additively. The covariate columns from data pass through to the returned tibble so downstream layers can recover them.

For per-shape derivations and decision rubrics, see the G-distribution catalog and standardization vignette. For the g_fn callback contract, see the Custom G distributions vignette. For the formula / beta / data covariate surface, see the Covariates and precision dependence vignette.

References

Lee, J., Che, J., Rabe-Hesketh, S., Feller, A., & Miratrix, L. (2025). Improving the estimation of site-specific effects and their distribution in multisite trials. Journal of Educational and Behavioral Statistics, 50(5), 731–764. doi:10.3102/10769986241254286 .

Examples

# Gaussian (default — the canonical baseline).
gauss <- gen_effects(J = 10L, true_dist = "Gaussian", sigma_tau = 0.2)
head(gauss)
#> # A tibble: 6 × 3
#>   site_index     z_j    tau_j
#>        <int>   <dbl>    <dbl>
#> 1          1  0.233   0.0466 
#> 2          2  0.0311  0.00621
#> 3          3  0.358   0.0716 
#> 4          4  1.61    0.322  
#> 5          5  1.43    0.286  
#> 6          6 -0.948  -0.190  

# Student-t with df = 5 — heavier tails for a robustness check.
studentt <- gen_effects(J = 50L, true_dist = "StudentT",
                        sigma_tau = 0.2, theta_G = list(nu = 5))

# Mixture: two-component bimodal effects.
mix <- gen_effects(J = 50L, true_dist = "Mixture",
                   sigma_tau = 0.2,
                   theta_G = list(delta = 1.0, eps = 0.2, ups = 2.0))
table(mix$latent_component)
#> 
#>  1  2 
#> 42  8 

# Covariate-adjusted: tau_j = tau + 0.3 * x_j + sigma_tau * z_j.
sites <- data.frame(x = rnorm(20))
cov <- gen_effects(J = 20L, true_dist = "Gaussian",
                   formula = ~ x, beta = 0.3, data = sites,
                   sigma_tau = 0.15)

# User callback (Convention A — standardized residuals).
my_g <- function(J) rnorm(J)
user <- gen_effects(J = 50L, g_fn = my_g)  # auto-selects true_dist = "User"