Methodology: The Bayesian Hybrid Framework
JoonHo Lee
2026-01-24
Source:vignettes/methodology.Rmd
methodology.RmdOverview
This vignette provides a comprehensive treatment of the Bayesian Hybrid Framework implemented in the bhfvar package. We cover:
- The statistical challenges motivating the framework
- The Bayesian Pseudo-Likelihood (BPL) approach
- The Hybrid Generalized Linear Mixed Model
- Weight scaling for multilevel models
- De-attenuation for finite-sample variance inflation
1. The Statistical Challenges
1.1 Variance Decomposition in Surveys
Consider a population of units partitioned into non-overlapping domains (e.g., states). For a binary outcome , the population proportion in domain is: where is the set of population units in domain and .
The population-level variance decomposition partitions total variance into between-domain and within-domain components: where are population shares and .
The Intraclass Correlation Coefficient (ICC) measures the proportion of variance attributable to between-domain differences:
1.2 Challenge 1: Informative Sampling
Complex survey designs use stratification and clustering for cost-efficiency. When the design is informative—meaning design features correlate with the outcome—the sample distribution differs from the population:
This creates a problem: observed between-domain variance conflates:
- Substantive domain effects (what we want to measure)
- Design artifacts (strata/PSU effects that happen to be distributed unevenly across domains)
1.3 Challenge 2: Finite-Sample Variance Inflation
Even without informative sampling, estimating domain proportions from small samples introduces noise:
The naive between-domain variance estimator is biased upward:
The second term represents average sampling variance—noise that masquerades as signal.
2. Bayesian Pseudo-Likelihood (BPL)
2.1 The Design-Based Problem
Standard likelihood-based inference treats observations as exchangeable:
Under informative sampling, this yields biased population estimates because over-sampled subgroups receive equal weight despite representing smaller population shares.
2.2 The BPL Solution
The Bayesian Pseudo-Likelihood approach (Savitsky & Toth, 2016) exponentiates each likelihood contribution by the survey weight:
The resulting pseudo-posterior is:
This approach:
- Upweights observations representing large population subgroups
- Downweights over-sampled subgroups
- Achieves design-consistent inference for population parameters
2.3 Weight Scaling
For multilevel models, the scale of weights matters (Rabe-Hesketh & Skrondal, 2006). Raw weights summing to the population size cause:
- Inflated “apparent” cluster sizes
- Biased variance component estimates
The bhfvar package uses Method 2 scaling (Pfeffermann et al., 1998):
This scales weights to sum to the sample size , preserving relative importance while stabilizing estimation.
3. The Hybrid GLMM
3.1 Model Structure
The Hybrid GLMM models the log-odds of the outcome:
with random effects:
The outcome follows:
3.2 Why “Hybrid”?
The model is “hybrid” in two senses:
Inference approach: Combines design-based weights (for representativeness) with model-based structure (for variance components)
Random effects structure: Includes both substantive effects (states, the research target) and nuisance effects (PSUs, design artifacts)
This structure correctly partitions variance: captures substantive domain heterogeneity net of design-induced clustering.
4. Variance Decomposition
4.1 Logit Scale (Estimand A)
On the latent logit scale, variance components decompose additively:
The level-1 variance comes from the logistic distribution.
The ICC on the logit scale is:
4.2 Probability Scale (Estimand B)
To compute variance on the probability scale, we transform state effects using the Zeger marginal adjustment (Zeger et al., 1988): where:
Between-state variance on the probability scale:
Within-state (Bernoulli) variance:
4.3 De-attenuated (Estimand A*)
The de-attenuated estimand corrects for finite-sample variance inflation:
where is the design-based sampling variance for state , estimated via Taylor linearization.
This correction is applied within the posterior: at each MCMC iteration, we subtract the estimated sampling variance and compute the de-attenuated ICC, properly propagating uncertainty.
5. Implementation in Stan
5.1 Key Code Blocks
Data block defines inputs:
data {
int<lower=1> N; // Sample size
int<lower=1> S; // Number of states
int<lower=1> J; // Number of PSUs
array[N] int<lower=0, upper=1> y; // Binary outcome
array[N] int<lower=1, upper=S> state_id; // State indicator
array[N] real<lower=0> w_lik; // Likelihood weights
array[S] real<lower=0, upper=1> w_state_pop_share; // Population shares
array[S] real<lower=0> vhat_state; // Sampling variances
}Model block specifies the pseudo-likelihood:
model {
// Priors
alpha ~ normal(prior_alpha_mean, prior_alpha_sd);
sigma_state ~ normal(0, 1);
sigma_psu ~ normal(0, 0.5);
z_state ~ std_normal();
z_psu ~ std_normal();
// Pseudo-likelihood
for (i in 1:N) {
target += w_norm[i] * bernoulli_logit_lpmf(y[i] | eta[i]);
}
}Generated quantities computes the three estimands at each iteration.
6. Shrinkage and Reliability
6.1 Bayesian Shrinkage
The multilevel structure induces partial pooling: domain estimates are pulled toward the grand mean. The shrinkage intensity depends on:
- Sample size : Small domains shrink more
- Sampling variance : Noisy estimates shrink more
- Between-domain variance : Low variance means more shrinkage (domains are similar)
7. Practical Recommendations
7.1 Sample Size Guidelines
| Scenario | Recommendation |
|---|---|
| Total N > 500, S > 20 | Standard settings work well |
| Small N or few domains | Consider tighter priors, more iterations |
| Very sparse domains (n_s < 5) | Results will be heavily shrunk; interpret cautiously |
Summary
The Bayesian Hybrid Framework addresses the fundamental challenges of variance decomposition in complex surveys:
| Challenge | Solution |
|---|---|
| Informative sampling | Bayesian Pseudo-Likelihood |
| Design artifacts | Hybrid GLMM with nuisance random effects |
| Finite-sample inflation | De-attenuation correction |
| Small domain samples | Bayesian shrinkage |
| Uncertainty quantification | Full posterior inference |
The result is design-consistent estimates of substantive domain heterogeneity, with appropriate uncertainty quantification and protection against both design confounding and sampling noise.
References
Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H., & Rasbash, J. (1998). Weighting for unequal selection probabilities in multilevel models. Journal of the Royal Statistical Society: Series B, 60(1), 23–40.
Rabe-Hesketh, S., & Skrondal, A. (2006). Multilevel modelling of complex survey data. Journal of the Royal Statistical Society: Series A, 169(4), 805–827.
Savitsky, T. D., & Toth, D. (2016). Bayesian estimation under informative sampling. Electronic Journal of Statistics, 10(1), 1677–1708.
Zeger, S. L., Liang, K.-Y., & Albert, P. S. (1988). Models for longitudinal data: A generalized estimating equation approach. Biometrics, 44(4), 1049–1060.
For questions or feedback, please visit the GitHub repository.