Skip to contents

Overview

This vignette provides a comprehensive treatment of the Bayesian Hybrid Framework implemented in the bhfvar package. We cover:

  1. The statistical challenges motivating the framework
  2. The Bayesian Pseudo-Likelihood (BPL) approach
  3. The Hybrid Generalized Linear Mixed Model
  4. Weight scaling for multilevel models
  5. De-attenuation for finite-sample variance inflation

1. The Statistical Challenges

1.1 Variance Decomposition in Surveys

Consider a population of NN units partitioned into SS non-overlapping domains (e.g., states). For a binary outcome YY, the population proportion in domain ss is: ps=iUsYiNs p_s = \frac{\sum_{i \in U_s} Y_i}{N_s} where UsU_s is the set of population units in domain ss and Ns=|Us|N_s = |U_s|.

The population-level variance decomposition partitions total variance into between-domain and within-domain components: Var(Y)=s=1Sπs(psp)2Between-domain+s=1Sπsps(1ps)Within-domain \text{Var}(Y) = \underbrace{\sum_{s=1}^S \pi_s (p_s - \bar{p})^2}_{\text{Between-domain}} + \underbrace{\sum_{s=1}^S \pi_s \cdot p_s(1-p_s)}_{\text{Within-domain}} where πs=Ns/N\pi_s = N_s/N are population shares and p=sπsps\bar{p} = \sum_s \pi_s p_s.

The Intraclass Correlation Coefficient (ICC) measures the proportion of variance attributable to between-domain differences: ICC=VarbetweenVartotal \text{ICC} = \frac{\text{Var}_{\text{between}}}{\text{Var}_{\text{total}}}

1.2 Challenge 1: Informative Sampling

Complex survey designs use stratification and clustering for cost-efficiency. When the design is informative—meaning design features correlate with the outcome—the sample distribution differs from the population: Esample[Yistratum,PSU]Epopulation[Yi] E_{\text{sample}}[Y_i \mid \text{stratum}, \text{PSU}] \neq E_{\text{population}}[Y_i]

This creates a problem: observed between-domain variance conflates:

  • Substantive domain effects (what we want to measure)
  • Design artifacts (strata/PSU effects that happen to be distributed unevenly across domains)

1.3 Challenge 2: Finite-Sample Variance Inflation

Even without informative sampling, estimating domain proportions p̂s\hat{p}_s from small samples introduces noise: p̂s=ps+es,E[es]=0,Var(es)=Vs \hat{p}_s = p_s + e_s, \quad E[e_s] = 0, \quad \text{Var}(e_s) = V_s

The naive between-domain variance estimator is biased upward: E[σ̂between2]σbetween2+s=1SπsVs E[\hat{\sigma}^2_{\text{between}}] \approx \sigma^2_{\text{between}} + \sum_{s=1}^S \pi_s V_s

The second term represents average sampling variance—noise that masquerades as signal.

2. Bayesian Pseudo-Likelihood (BPL)

2.1 The Design-Based Problem

Standard likelihood-based inference treats observations as exchangeable: L(θ)=i𝒮P(Yiθ) L(\theta) = \prod_{i \in \mathcal{S}} P(Y_i \mid \theta)

Under informative sampling, this yields biased population estimates because over-sampled subgroups receive equal weight despite representing smaller population shares.

2.2 The BPL Solution

The Bayesian Pseudo-Likelihood approach (Savitsky & Toth, 2016) exponentiates each likelihood contribution by the survey weight: LBPL(Yθ)=i𝒮[P(Yiθ)]wi* L_{\text{BPL}}(Y \mid \theta) = \prod_{i \in \mathcal{S}} [P(Y_i \mid \theta)]^{w_i^*}

The resulting pseudo-posterior is: PBPL(θY,w)LBPL(Yθ)P(θ) P_{\text{BPL}}(\theta \mid Y, w) \propto L_{\text{BPL}}(Y \mid \theta) \cdot P(\theta)

This approach:

  • Upweights observations representing large population subgroups
  • Downweights over-sampled subgroups
  • Achieves design-consistent inference for population parameters

2.3 Weight Scaling

For multilevel models, the scale of weights matters (Rabe-Hesketh & Skrondal, 2006). Raw weights summing to the population size cause:

  • Inflated “apparent” cluster sizes
  • Biased variance component estimates

The bhfvar package uses Method 2 scaling (Pfeffermann et al., 1998): wi*=wini𝒮wi w_i^* = w_i \cdot \frac{n}{\sum_{i \in \mathcal{S}} w_i}

This scales weights to sum to the sample size nn, preserving relative importance while stabilizing estimation.

3. The Hybrid GLMM

3.1 Model Structure

The Hybrid GLMM models the log-odds of the outcome: ηi=α+ustate[i]+upsu[i] \eta_i = \alpha + u_{\text{state}[i]} + u_{\text{psu}[i]}

with random effects: ustate[s]N(0,σstate2) u_{\text{state}[s]} \sim N(0, \sigma^2_{\text{state}}) upsu[j]N(0,σpsu2) u_{\text{psu}[j]} \sim N(0, \sigma^2_{\text{psu}})

The outcome follows: YiηiBernoulli(logit1(ηi)) Y_i \mid \eta_i \sim \text{Bernoulli}(\text{logit}^{-1}(\eta_i))

3.2 Why “Hybrid”?

The model is “hybrid” in two senses:

  1. Inference approach: Combines design-based weights (for representativeness) with model-based structure (for variance components)

  2. Random effects structure: Includes both substantive effects (states, the research target) and nuisance effects (PSUs, design artifacts)

This structure correctly partitions variance: σstate2\sigma^2_{\text{state}} captures substantive domain heterogeneity net of design-induced clustering.

3.3 Centering Constraints

For identifiability and interpretability, we apply centering:

State effects use population-weighted centering: s=1Sπsustate[s]=0 \sum_{s=1}^S \pi_s \cdot u_{\text{state}[s]} = 0

This ensures α\alpha represents the population mean log-odds.

PSU effects use within-stratum centering: jhupsu[j]=0h \sum_{j \in h} u_{\text{psu}[j]} = 0 \quad \forall h

This prevents confounding between stratum and PSU effects.

3.4 Prior Specification

The bhfvar package uses weakly informative priors:

Parameter Prior Rationale
α\alpha N(logit(y),1.5)N(\text{logit}(\bar{y}), 1.5) Centered on observed proportion
σstate\sigma_{\text{state}} Half-N(0,1)N(0, 1) Allows moderate heterogeneity
σpsu\sigma_{\text{psu}} Half-N(0,0.5)N(0, 0.5) Tighter prior for design effects

The non-centered parameterization improves MCMC sampling: ustate[s]=σstatezstate[s],zstate[s]N(0,1) u_{\text{state}[s]} = \sigma_{\text{state}} \cdot z_{\text{state}[s]}, \quad z_{\text{state}[s]} \sim N(0, 1)

4. Variance Decomposition

4.1 Logit Scale (Estimand A)

On the latent logit scale, variance components decompose additively: Var(η)=σstate2+σpsu2+π23 \text{Var}(\eta) = \sigma^2_{\text{state}} + \sigma^2_{\text{psu}} + \frac{\pi^2}{3}

The level-1 variance π2/3\pi^2/3 comes from the logistic distribution.

The ICC on the logit scale is: ICCA=σstate2σstate2+σpsu2+π2/3 \text{ICC}_A = \frac{\sigma^2_{\text{state}}}{\sigma^2_{\text{state}} + \sigma^2_{\text{psu}} + \pi^2/3}

4.2 Probability Scale (Estimand B)

To compute variance on the probability scale, we transform state effects using the Zeger marginal adjustment (Zeger et al., 1988): ps=logit1(c(α+ustate[s])) p_s = \text{logit}^{-1}(c \cdot (\alpha + u_{\text{state}[s]})) where: c=11+0.346σstate2 c = \sqrt{\frac{1}{1 + 0.346 \cdot \sigma^2_{\text{state}}}}

Between-state variance on the probability scale: Varbetween=s=1Sπs(psp)2 \text{Var}_{\text{between}} = \sum_{s=1}^S \pi_s (p_s - \bar{p})^2

Within-state (Bernoulli) variance: Varwithin=s=1Sπsps(1ps) \text{Var}_{\text{within}} = \sum_{s=1}^S \pi_s \cdot p_s(1-p_s)

4.3 De-attenuated (Estimand A*)

The de-attenuated estimand corrects for finite-sample variance inflation: Varbetween*=max(0,Varbetweens=1SπsV̂s) \text{Var}_{\text{between}}^{*} = \max\left(0, \text{Var}_{\text{between}} - \sum_{s=1}^S \pi_s \hat{V}_s\right)

where V̂s\hat{V}_s is the design-based sampling variance for state ss, estimated via Taylor linearization.

This correction is applied within the posterior: at each MCMC iteration, we subtract the estimated sampling variance and compute the de-attenuated ICC, properly propagating uncertainty.

5. Implementation in Stan

5.1 Key Code Blocks

Data block defines inputs:

data {
  int<lower=1> N;                           // Sample size
  int<lower=1> S;                           // Number of states
  int<lower=1> J;                           // Number of PSUs
  array[N] int<lower=0, upper=1> y;         // Binary outcome
  array[N] int<lower=1, upper=S> state_id;  // State indicator
  array[N] real<lower=0> w_lik;             // Likelihood weights
  array[S] real<lower=0, upper=1> w_state_pop_share;  // Population shares
  array[S] real<lower=0> vhat_state;        // Sampling variances
}

Model block specifies the pseudo-likelihood:

model {
  // Priors
  alpha ~ normal(prior_alpha_mean, prior_alpha_sd);
  sigma_state ~ normal(0, 1);
  sigma_psu ~ normal(0, 0.5);
  z_state ~ std_normal();
  z_psu ~ std_normal();
  
  // Pseudo-likelihood
  for (i in 1:N) {
    target += w_norm[i] * bernoulli_logit_lpmf(y[i] | eta[i]);
  }
}

Generated quantities computes the three estimands at each iteration.

5.2 Computational Considerations

  • Parallel chains: Use options(mc.cores = parallel::detectCores())
  • Model caching: Stan compiles once, then reuses: rstan_options(auto_write = TRUE)
  • Adaptation: 1000 warmup iterations is typically sufficient
  • Thinning: Usually unnecessary with modern HMC

6. Shrinkage and Reliability

6.1 Bayesian Shrinkage

The multilevel structure induces partial pooling: domain estimates are pulled toward the grand mean. The shrinkage intensity depends on:

  • Sample size nsn_s: Small domains shrink more
  • Sampling variance VsV_s: Noisy estimates shrink more
  • Between-domain variance σstate2\sigma^2_{\text{state}}: Low variance means more shrinkage (domains are similar)

6.2 Reliability

The reliability for domain ss quantifies shrinkage: Rs=σstate2σstate2+Vs R_s = \frac{\sigma^2_{\text{state}}}{\sigma^2_{\text{state}} + V_s}

Interpretation:

  • Rs1R_s \approx 1: Little shrinkage (large sample, low noise)
  • Rs0R_s \approx 0: Heavy shrinkage (small sample, high noise)

This matches the formula for reliability in classical test theory, where true score variance is separated from error variance.

7. Practical Recommendations

7.1 Sample Size Guidelines

Scenario Recommendation
Total N > 500, S > 20 Standard settings work well
Small N or few domains Consider tighter priors, more iterations
Very sparse domains (n_s < 5) Results will be heavily shrunk; interpret cautiously

7.2 Convergence Checking

Always verify:

  • No divergent transitions (indicates geometry problems)
  • Rhat < 1.01 for all parameters (chain mixing)
  • ESS > 400 for quantities of interest (sufficient samples)

7.3 Sensitivity Analysis

Consider varying:

  • Prior distributions (especially on variance components)
  • Weight scaling methods
  • Whether to include de-attenuation

Summary

The Bayesian Hybrid Framework addresses the fundamental challenges of variance decomposition in complex surveys:

Challenge Solution
Informative sampling Bayesian Pseudo-Likelihood
Design artifacts Hybrid GLMM with nuisance random effects
Finite-sample inflation De-attenuation correction
Small domain samples Bayesian shrinkage
Uncertainty quantification Full posterior inference

The result is design-consistent estimates of substantive domain heterogeneity, with appropriate uncertainty quantification and protection against both design confounding and sampling noise.

References

Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H., & Rasbash, J. (1998). Weighting for unequal selection probabilities in multilevel models. Journal of the Royal Statistical Society: Series B, 60(1), 23–40.

Rabe-Hesketh, S., & Skrondal, A. (2006). Multilevel modelling of complex survey data. Journal of the Royal Statistical Society: Series A, 169(4), 805–827.

Savitsky, T. D., & Toth, D. (2016). Bayesian estimation under informative sampling. Electronic Journal of Statistics, 10(1), 1677–1708.

Zeger, S. L., Liang, K.-Y., & Albert, P. S. (1988). Models for longitudinal data: A generalized estimating equation approach. Biometrics, 44(4), 1049–1060.


For questions or feedback, please visit the GitHub repository.