bhfvar: Bayesian Hybrid Framework for Variance Decomposition
JoonHo Lee
2026-01-24
Source:vignettes/introduction.Rmd
introduction.RmdIntroduction
The bhfvar package implements the Bayesian Hybrid Framework for variance decomposition in complex surveys with post-hoc (non-design-based) domains. When researchers use national surveys to study geographic variation—such as state-level differences in policy outcomes—they face a fundamental design-analysis mismatch: the survey was designed for national estimates, not for the sub-national comparisons being attempted.
This mismatch creates two critical challenges:
Informative Sampling: Complex survey designs (stratification, clustering) may be correlated with the outcome, confounding substantive domain heterogeneity with design artifacts.
Finite-Sample Variance Inflation: When domain sample sizes are small, sampling noise inflates apparent between-domain variance, conflating signal with noise.
The bhfvar package addresses both challenges through a unified Bayesian framework that combines design-based principles with model-based inference.
Installation
# From GitHub (development version)
devtools::install_github("joonho112/bhfvar")The Core Problem: Signal vs. Noise
Motivating Example: Child Care Subsidy Receipt
Consider the National Survey of Early Care and Education (NSECE), which uses a complex multi-stage sampling design to achieve national representativeness. Suppose a researcher wants to estimate the proportion of home-based child care providers receiving subsidies in each U.S. state, and quantify how much this proportion varies across states.
A naive approach would:
- Compute the weighted proportion for each state
- Calculate the variance of these estimates across states
However, this approach faces two problems:
Problem 1: Design Artifacts
# Illustration of the problem (conceptual)
# State A might have high subsidy rates NOT because of state policy,
# but because it happens to contain PSUs from high-subsidy strata
# The observed variance mixes:
# - True state effects (policy, economy, demographics)
# - Stratum effects (urban/rural, region)
# - PSU effects (local labor markets, provider networks)Problem 2: Sampling Noise
# A state with only 5 sampled providers will have a noisy estimate
# This noise gets counted as "between-state variance"
# E[Var_between_naive] ≈ Var_between_true + Σ π_s × V_s
# ↑
# This term inflates the estimate!The Bayesian Hybrid Framework Solution
Key Innovations
The bhfvar package implements several key methodological innovations:
Bayesian Pseudo-Likelihood (BPL): Incorporates survey weights into the likelihood to achieve design-consistent inference while maintaining the benefits of Bayesian multilevel modeling.
Hybrid GLMM Structure: Simultaneously models substantive domain effects (states) and nuisance design effects (strata, PSUs), correctly partitioning variance attributable to each source.
-
Dual Estimand Framework: Distinguishes between:
- Estimand A (Policy): Substantive variance net of design artifacts
- Estimand B (Descriptive): Total observed variance
- Estimand A* (De-attenuated): Policy variance corrected for finite-sample inflation
Integrated De-attenuation: Both implicit (via Bayesian shrinkage) and explicit (method-of-moments correction) approaches to remove sampling noise from variance estimates.
Package Capabilities
The bhfvar package provides a complete workflow for variance decomposition:
1. Model Compilation and Data Preparation
library(bhfvar)
# Step 1: Compile the Stan model (once per session)
model <- compile_bhf_model()
# Step 2: Prepare your data for Stan
data(bhf_synthetic_data) # Example dataset
prepared <- prepare_bhf_data(
data = bhf_synthetic_data,
outcome = "has_subsidy",
domain = "state",
strata = "stratum",
psu = "psu",
weights = "weight"
)
print(prepared)3. Variance Decomposition
# Step 4: Extract variance decomposition across all three estimands
vd <- variance_decomposition(fit)
# The output shows:
# - Estimand A: ICC on logit scale (policy estimand)
# - Estimand B: ICC on probability scale (descriptive estimand)
# - Estimand A*: De-attenuated ICC (policy adjusted)4. Domain Estimates
# Step 5: Get domain-specific estimates with shrinkage
estimates <- domain_estimates(fit, type = "marginal")
# Each domain gets:
# - Posterior mean and credible interval
# - Reliability (shrinkage factor)
# - Population shareUnderstanding the Three Estimands
The package computes three distinct variance decomposition estimands:
Estimand A: Policy (Logit Scale)
This estimand operates on the latent logit scale and isolates substantive state variation by factoring out design effects during model estimation.
Estimand B: Descriptive (Probability Scale)
This estimand captures total observed variance on the probability scale, representing what a researcher would see in the data.
Estimand A*: Policy Adjusted (De-attenuated)
where is the population-weighted average sampling variance. This estimand removes finite-sample inflation from the descriptive variance.
Interpreting the Differences
| Estimand | Question Answered | When to Use |
|---|---|---|
| A | How much do latent propensities vary across states? | Theoretical modeling |
| B | How much do observed rates vary across states? | Describing the data landscape |
| A* | How much substantive variation exists after removing noise? | Policy evaluation |
The gap between B and A* reveals how much of the apparent geographic variation is actually attributable to sampling noise rather than true differences.
Typical Results Pattern
In our motivating application (NSECE data on child care subsidy receipt), we found:
# Illustrative results (from simulation studies)
#
# Estimand A (logit): ICC = 0.078 [0.032, 0.146]
# Estimand B (prob): ICC = 0.042 [0.019, 0.076]
# Estimand A* (de-atten): ICC = 0.006 [0.005, 0.006]
#
# Interpretation:
# - The observed between-state variation (B) is substantial
# - But most of it is sampling noise, not substantive differences
# - After de-attenuation (A*), only ~0.6% of total variance is between states
# - This is a 7-fold reduction from the naive estimate!This pattern—where de-attenuation reveals much smaller substantive heterogeneity than naively apparent—is common when domains have small sample sizes and high sampling variance.
When to Use bhfvar
Recommended Use Cases
The bhfvar package is particularly valuable for:
Sub-national analysis of national surveys: When using surveys like NSECE, NHES, or ECLS to study state-level variation
Domain estimation with small samples: When some domains have few observations, making direct estimation unreliable
Variance decomposition with complex designs: When you need to separate substantive variation from design artifacts
Policy evaluation requiring geographic comparisons: When the question is whether outcomes truly differ across administrative units
When Other Approaches May Be Appropriate
Consider alternatives when:
Domains are design-based: If the survey was designed with your domains as primary sampling units, standard methods may suffice
All domains have large samples: With 100+ observations per domain, direct estimation becomes reliable
Design is simple random sampling: Without stratification or clustering, the design-based complications disappear
Road Map to the Vignettes
The bhfvar package includes comprehensive documentation organized into two tracks:
Applied Researchers Track
For users who want to apply the package effectively:
| Vignette | Purpose | Reading Time |
|---|---|---|
| Quick Start | Your first variance decomposition in 5 minutes | 5 min |
| Complete Workflow | End-to-end analysis with real data | 30 min |
| Diagnostics | Convergence checking and model validation | 20 min |
Methodological Researchers Track
For users interested in the mathematical foundations:
| Vignette | Purpose | Reading Time |
|---|---|---|
| Methodology | The Bayesian Hybrid Framework in detail | 45 min |
| Dual Estimands | Understanding A, B, and A* | 30 min |
| API Reference | Complete function documentation | Reference |
Summary
The bhfvar package addresses a fundamental challenge in geographic policy analysis: how to correctly decompose variance when using complex surveys for sub-national inference. Key features include:
Design-consistent inference through Bayesian Pseudo-Likelihood
Correct variance partitioning via the Hybrid GLMM structure
Multiple estimands for different research questions (policy vs. descriptive)
Automatic de-attenuation to remove finite-sample variance inflation
Domain-specific estimates with appropriate uncertainty quantification
The package is especially valuable in low-information settings—where domain sample sizes are small and sampling variance is high—because it reveals how much of the apparent geographic variation is true signal versus noise.
References
Lee, J., & Hooper, A. (2025). Disentangling signal from noise: A Bayesian hybrid framework for variance decomposition in complex surveys with post-hoc domains. Mathematics (under review).
Rabe-Hesketh, S., & Skrondal, A. (2006). Multilevel modelling of complex survey data. Journal of the Royal Statistical Society: Series A, 169(4), 805–827.
Savitsky, T. D., & Toth, D. (2016). Bayesian estimation under informative sampling. Electronic Journal of Statistics, 10(1), 1677–1708.
Zeger, S. L., Liang, K.-Y., & Albert, P. S. (1988). Models for longitudinal data: A generalized estimating equation approach. Biometrics, 44(4), 1049–1060.
For questions or feedback about this package, please visit the GitHub repository.