Skip to contents

Transforms survey data into the format required by the BHF Stan model. This function handles index recoding, weight scaling, and computation of design effect estimates needed for de-attenuation.

Usage

prepare_bhf_data(
  data,
  outcome,
  domain,
  strata,
  psu,
  weights,
  population_shares = NULL,
  use_deattenuation = TRUE,
  prior_alpha_mean = NULL,
  prior_alpha_sd = 1.5
)

Arguments

data

A data frame containing the survey data.

outcome

Character string. Name of the binary outcome variable (0/1).

domain

Character string. Name of the domain/state variable.

strata

Character string. Name of the stratification variable.

psu

Character string. Name of the PSU (primary sampling unit) variable.

weights

Character string. Name of the survey weight variable.

population_shares

Optional numeric vector of length S (number of domains) containing population shares for each domain. Must sum to 1. If NULL, shares are estimated from the weighted data.

use_deattenuation

Logical. If TRUE (default), computes and applies de-attenuation adjustment for finite-sample variance inflation.

prior_alpha_mean

Numeric. Prior mean for the intercept on logit scale. Default is NULL, which estimates from data.

prior_alpha_sd

Numeric. Prior SD for the intercept. Default is 1.5.

Value

An object of class bhf_data containing:

stan_data

List of data formatted for Stan

mapping

List containing domain/strata/PSU label mappings

domain_summary

Data frame with domain-level summary statistics

input_info

List recording input column names and settings

Details

This function performs several critical transformations:

Index Recoding

All grouping variables are recoded to consecutive integers starting from 1 (required by Stan).

Weight Scaling

Weights are scaled using Method D2 from Pfeffermann et al. (1998) to have effective sample size within each domain. This is critical for proper pseudo-likelihood estimation.

Sampling Variance Estimation

For each domain, estimates the sampling variance of the proportion using the design effect.

PSU Structure

Creates the nested PSU-within-stratum structure required by the Stan model.

Weight Scaling (Method D2)

Weights are scaled so that for each domain s: $$w^*_i = w_i \times \frac{n^{eff}_s}{\sum_{i \in s} w_i}$$ where \(n^{eff}_s = (\sum w_i)^2 / \sum w_i^2\) is the effective sample size. This ensures the pseudo-likelihood contributes appropriate information.

Sampling Variance Estimation

The estimated sampling variance for domain s is: $$\hat{V}_s = \frac{deff_s \times \hat{p}_s(1-\hat{p}_s)}{n_s}$$ where \(deff_s\) is the design effect. A default value of 1.5 is used when the design effect cannot be reliably estimated.

Examples

if (FALSE) { # \dontrun{
# Load example data
data(bhf_synthetic_data)

# Prepare data for Stan
prepared <- prepare_bhf_data(
  data = bhf_synthetic_data,
  outcome = "has_subsidy",
  domain = "state",
  strata = "stratum",
  psu = "psu",
  weights = "weight"
)

# Inspect the result
print(prepared)
summary(prepared)
} # }