Methodology: The Bayesian Hybrid Framework

Overview

This vignette provides a comprehensive treatment of the Bayesian Hybrid Framework implemented in the bhfvar package. We cover:

The statistical challenges motivating the framework
The Bayesian Pseudo-Likelihood (BPL) approach
The Hybrid Generalized Linear Mixed Model
Weight scaling for multilevel models
De-attenuation for finite-sample variance inflation

1. The Statistical Challenges

1.1 Variance Decomposition in Surveys

Consider a population of $N$ units partitioned into $S$ non-overlapping domains (e.g., states). For a binary outcome $Y$ , the population proportion in domain $s$ is: $p_s = \frac{\sum_{i \in U_s} Y_i}{N_s}$ where $U_s$ is the set of population units in domain $s$ and $N_s = |U_s|$ .

The population-level variance decomposition partitions total variance into between-domain and within-domain components: $\text{Var}(Y) = \underbrace{\sum_{s=1}^S \pi_s (p_s - \bar{p})^2}_{\text{Between-domain}} + \underbrace{\sum_{s=1}^S \pi_s \cdot p_s(1-p_s)}_{\text{Within-domain}}$ where $\pi_s = N_s/N$ are population shares and $\bar{p} = \sum_s \pi_s p_s$ .

The Intraclass Correlation Coefficient (ICC) measures the proportion of variance attributable to between-domain differences: $\text{ICC} = \frac{\text{Var}_{\text{between}}}{\text{Var}_{\text{total}}}$

1.2 Challenge 1: Informative Sampling

Complex survey designs use stratification and clustering for cost-efficiency. When the design is informative—meaning design features correlate with the outcome—the sample distribution differs from the population: $E_{\text{sample}}[Y_i \mid \text{stratum}, \text{PSU}] \neq E_{\text{population}}[Y_i]$

This creates a problem: observed between-domain variance conflates:

Substantive domain effects (what we want to measure)
Design artifacts (strata/PSU effects that happen to be distributed unevenly across domains)

1.3 Challenge 2: Finite-Sample Variance Inflation

Even without informative sampling, estimating domain proportions $\hat{p}_s$ from small samples introduces noise: $\hat{p}_s = p_s + e_s, \quad E[e_s] = 0, \quad \text{Var}(e_s) = V_s$

The naive between-domain variance estimator is biased upward: $E[\hat{\sigma}^2_{\text{between}}] \approx \sigma^2_{\text{between}} + \sum_{s=1}^S \pi_s V_s$

The second term represents average sampling variance—noise that masquerades as signal.

2. Bayesian Pseudo-Likelihood (BPL)

2.1 The Design-Based Problem

Standard likelihood-based inference treats observations as exchangeable: $L(\theta) = \prod_{i \in \mathcal{S}} P(Y_i \mid \theta)$

Under informative sampling, this yields biased population estimates because over-sampled subgroups receive equal weight despite representing smaller population shares.

2.2 The BPL Solution

The Bayesian Pseudo-Likelihood approach (Savitsky & Toth, 2016) exponentiates each likelihood contribution by the survey weight: $L_{\text{BPL}}(Y \mid \theta) = \prod_{i \in \mathcal{S}} [P(Y_i \mid \theta)]^{w_i^*}$

The resulting pseudo-posterior is: $P_{\text{BPL}}(\theta \mid Y, w) \propto L_{\text{BPL}}(Y \mid \theta) \cdot P(\theta)$

This approach:

Upweights observations representing large population subgroups
Downweights over-sampled subgroups
Achieves design-consistent inference for population parameters

2.3 Weight Scaling

For multilevel models, the scale of weights matters (Rabe-Hesketh & Skrondal, 2006). Raw weights summing to the population size cause:

Inflated “apparent” cluster sizes
Biased variance component estimates

The bhfvar package uses Method 2 scaling (Pfeffermann et al., 1998): $w_i^* = w_i \cdot \frac{n}{\sum_{i \in \mathcal{S}} w_i}$

This scales weights to sum to the sample size $n$ , preserving relative importance while stabilizing estimation.

3. The Hybrid GLMM

3.1 Model Structure

The Hybrid GLMM models the log-odds of the outcome: $\eta_i = \alpha + u_{\text{state}[i]} + u_{\text{psu}[i]}$

with random effects: $u_{\text{state}[s]} \sim N(0, \sigma^2_{\text{state}})$ $u_{\text{psu}[j]} \sim N(0, \sigma^2_{\text{psu}})$

The outcome follows: $Y_i \mid \eta_i \sim \text{Bernoulli}(\text{logit}^{-1}(\eta_i))$

3.2 Why “Hybrid”?

The model is “hybrid” in two senses:

Inference approach: Combines design-based weights (for representativeness) with model-based structure (for variance components)
Random effects structure: Includes both substantive effects (states, the research target) and nuisance effects (PSUs, design artifacts)

This structure correctly partitions variance: $\sigma^2_{\text{state}}$ captures substantive domain heterogeneity net of design-induced clustering.

3.3 Centering Constraints

For identifiability and interpretability, we apply centering:

State effects use population-weighted centering: $\sum_{s=1}^S \pi_s \cdot u_{\text{state}[s]} = 0$

This ensures $\alpha$ represents the population mean log-odds.

PSU effects use within-stratum centering: $\sum_{j \in h} u_{\text{psu}[j]} = 0 \quad \forall h$

This prevents confounding between stratum and PSU effects.

3.4 Prior Specification

The bhfvar package uses weakly informative priors:

Parameter	Prior	Rationale
$\alpha$	$N(\text{logit}(\bar{y}), 1.5)$	Centered on observed proportion
$\sigma_{\text{state}}$	Half- $N(0, 1)$	Allows moderate heterogeneity
$\sigma_{\text{psu}}$	Half- $N(0, 0.5)$	Tighter prior for design effects

The non-centered parameterization improves MCMC sampling: $u_{\text{state}[s]} = \sigma_{\text{state}} \cdot z_{\text{state}[s]}, \quad z_{\text{state}[s]} \sim N(0, 1)$

4. Variance Decomposition

4.1 Logit Scale (Estimand A)

On the latent logit scale, variance components decompose additively: $\text{Var}(\eta) = \sigma^2_{\text{state}} + \sigma^2_{\text{psu}} + \frac{\pi^2}{3}$

The level-1 variance $\pi^2/3$ comes from the logistic distribution.

The ICC on the logit scale is: $\text{ICC}_A = \frac{\sigma^2_{\text{state}}}{\sigma^2_{\text{state}} + \sigma^2_{\text{psu}} + \pi^2/3}$

4.2 Probability Scale (Estimand B)

To compute variance on the probability scale, we transform state effects using the Zeger marginal adjustment (Zeger et al., 1988): $p_s = \text{logit}^{-1}(c \cdot (\alpha + u_{\text{state}[s]}))$ where: $c = \sqrt{\frac{1}{1 + 0.346 \cdot \sigma^2_{\text{state}}}}$

Between-state variance on the probability scale: $\text{Var}_{\text{between}} = \sum_{s=1}^S \pi_s (p_s - \bar{p})^2$

Within-state (Bernoulli) variance: $\text{Var}_{\text{within}} = \sum_{s=1}^S \pi_s \cdot p_s(1-p_s)$

4.3 De-attenuated (Estimand A*)

The de-attenuated estimand corrects for finite-sample variance inflation: $\text{Var}_{\text{between}}^{*} = \max\left(0, \text{Var}_{\text{between}} - \sum_{s=1}^S \pi_s \hat{V}_s\right)$

where $\hat{V}_s$ is the design-based sampling variance for state $s$ , estimated via Taylor linearization.

This correction is applied within the posterior: at each MCMC iteration, we subtract the estimated sampling variance and compute the de-attenuated ICC, properly propagating uncertainty.

5. Implementation in Stan

5.1 Key Code Blocks

Data block defines inputs:

data {
  int<lower=1> N;                           // Sample size
  int<lower=1> S;                           // Number of states
  int<lower=1> J;                           // Number of PSUs
  array[N] int<lower=0, upper=1> y;         // Binary outcome
  array[N] int<lower=1, upper=S> state_id;  // State indicator
  array[N] real<lower=0> w_lik;             // Likelihood weights
  array[S] real<lower=0, upper=1> w_state_pop_share;  // Population shares
  array[S] real<lower=0> vhat_state;        // Sampling variances
}

Model block specifies the pseudo-likelihood:

model {
  // Priors
  alpha ~ normal(prior_alpha_mean, prior_alpha_sd);
  sigma_state ~ normal(0, 1);
  sigma_psu ~ normal(0, 0.5);
  z_state ~ std_normal();
  z_psu ~ std_normal();
  
  // Pseudo-likelihood
  for (i in 1:N) {
    target += w_norm[i] * bernoulli_logit_lpmf(y[i] | eta[i]);
  }
}

Generated quantities computes the three estimands at each iteration.

5.2 Computational Considerations

Parallel chains: Use options(mc.cores = parallel::detectCores())
Model caching: Stan compiles once, then reuses: rstan_options(auto_write = TRUE)
Adaptation: 1000 warmup iterations is typically sufficient
Thinning: Usually unnecessary with modern HMC

6. Shrinkage and Reliability

6.1 Bayesian Shrinkage

The multilevel structure induces partial pooling: domain estimates are pulled toward the grand mean. The shrinkage intensity depends on:

Sample size $n_s$ : Small domains shrink more
Sampling variance $V_s$ : Noisy estimates shrink more
Between-domain variance $\sigma^2_{\text{state}}$ : Low variance means more shrinkage (domains are similar)

6.2 Reliability

The reliability for domain $s$ quantifies shrinkage: $R_s = \frac{\sigma^2_{\text{state}}}{\sigma^2_{\text{state}} + V_s}$

Interpretation:

$R_s \approx 1$ : Little shrinkage (large sample, low noise)
$R_s \approx 0$ : Heavy shrinkage (small sample, high noise)

This matches the formula for reliability in classical test theory, where true score variance is separated from error variance.

7. Practical Recommendations

7.1 Sample Size Guidelines

Scenario	Recommendation
Total N > 500, S > 20	Standard settings work well
Small N or few domains	Consider tighter priors, more iterations
Very sparse domains (n_s < 5)	Results will be heavily shrunk; interpret cautiously

7.2 Convergence Checking

Always verify:

No divergent transitions (indicates geometry problems)
Rhat < 1.01 for all parameters (chain mixing)
ESS > 400 for quantities of interest (sufficient samples)

7.3 Sensitivity Analysis

Consider varying:

Prior distributions (especially on variance components)
Weight scaling methods
Whether to include de-attenuation

Summary

The Bayesian Hybrid Framework addresses the fundamental challenges of variance decomposition in complex surveys:

Challenge	Solution
Informative sampling	Bayesian Pseudo-Likelihood
Design artifacts	Hybrid GLMM with nuisance random effects
Finite-sample inflation	De-attenuation correction
Small domain samples	Bayesian shrinkage
Uncertainty quantification	Full posterior inference

The result is design-consistent estimates of substantive domain heterogeneity, with appropriate uncertainty quantification and protection against both design confounding and sampling noise.

References

Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H., & Rasbash, J. (1998). Weighting for unequal selection probabilities in multilevel models. Journal of the Royal Statistical Society: Series B, 60(1), 23–40.

Rabe-Hesketh, S., & Skrondal, A. (2006). Multilevel modelling of complex survey data. Journal of the Royal Statistical Society: Series A, 169(4), 805–827.

Savitsky, T. D., & Toth, D. (2016). Bayesian estimation under informative sampling. Electronic Journal of Statistics, 10(1), 1677–1708.

Zeger, S. L., Liang, K.-Y., & Albert, P. S. (1988). Models for longitudinal data: A generalized estimating equation approach. Biometrics, 44(4), 1049–1060.

For questions or feedback, please visit the GitHub repository.

JoonHo Lee

2026-01-24