Principled Prior Elicitation for Dirichlet Process Mixture Models
The DPprior package provides tools for eliciting Gamma hyperpriors on the concentration parameter α in Dirichlet Process (DP) mixture models. Rather than requiring researchers to reason about the abstract parameter α, DPprior allows specification through intuitive quantities:
- Expected cluster counts: “How many distinct groups do you anticipate?”
- Weight concentration: “How evenly distributed do you expect observations across clusters?”
These natural questions are translated into principled Gamma(a, b) hyperpriors using computationally efficient algorithms backed by exact moment matching.
Installation
# Install from CRAN (when available)
install.packages("DPprior")
# Or install the development version from GitHub
# install.packages("devtools")
devtools::install_github("joonho112/DPprior")Quick Start
library(DPprior)
# Scenario: 50-site multisite trial, expecting ~5 distinct effect patterns
fit <- DPprior_fit(
J = 50, # Number of sites
mu_K = 5, # Expected clusters
confidence = "medium" # Moderate uncertainty
)
# View the elicited prior
print(fit)
#> DPprior Elicitation Results
#> ──────────────────────────────────────────────────────────────
#> Prior: α ~ Gamma(1.892, 1.201)
#> Target: E[K] = 5.00, Var(K) = 12.50
#> Achieved: E[K] = 5.00, Var(K) = 12.50
#> Method: A2-MN (converged in 3 iterations)
# Visualize the prior
plot(fit)Key Features
1. Intuitive Elicitation
Specify priors through expected cluster counts and uncertainty levels:
# Using confidence levels (recommended for most users)
fit <- DPprior_fit(J = 50, mu_K = 5, confidence = "medium")
# Or specify variance directly
fit <- DPprior_fit(J = 50, mu_K = 5, var_K = 10)2. Dual-Anchor Control
Go beyond cluster counts to control weight behavior, addressing the “unintended prior” problem (Vicentini & Jermyn, 2025):
# First, fit K-based prior
fit_K <- DPprior_fit(J = 50, mu_K = 5, var_K = 8)
# Check if largest cluster might dominate
prob_w1_exceeds(0.5, fit_K$a, fit_K$b)
#> [1] 0.52 # 52% chance one cluster has >50% of observations
# Apply dual-anchor constraint
w1_target <- list(prob = list(threshold = 0.5, value = 0.30))
fit_dual <- DPprior_dual(fit_K, w1_target, lambda = 0.5)
# Verify improvement
prob_w1_exceeds(0.5, fit_dual$a, fit_dual$b)
#> [1] 0.31 # Now only 31%3. Comprehensive Diagnostics
Verify your prior behaves as intended across all relevant dimensions: - K distribution (cluster counts) - w₁ distribution (largest cluster weight) - ρ distribution (co-clustering probability) - α distribution (concentration parameter)
fit <- DPprior_fit(J = 50, mu_K = 5, check_diagnostics = TRUE)
plot(fit) # Four-panel diagnostic dashboard
summary(fit) # Detailed numerical diagnostics4. Fast Computation
The package implements the DORO 2.0 methodology:
- A1 (Closed-form): Instant initial estimates using Negative Binomial approximation
- A2 (Newton refinement): Exact moment matching in 2-4 iterations
# A1 only (fastest, approximate)
fit_fast <- DPprior_fit(J = 50, mu_K = 5, var_K = 10, method = "A1")
# A2 with Newton refinement (default, exact)
fit_exact <- DPprior_fit(J = 50, mu_K = 5, var_K = 10, method = "A2-MN")When to Use DPprior
DPprior is particularly valuable for:
- Multisite randomized trials with moderate numbers of sites (J = 20–200)
- Meta-analysis with flexible heterogeneity modeling
- Bayesian nonparametric density estimation in small-to-moderate samples
- Low-information settings where the prior on α substantially influences posterior inference (Lee et al., 2025)
Vignettes
The package includes comprehensive documentation:
Applied Researchers Track
| Vignette | Description |
|---|---|
| Introduction | Why prior elicitation matters |
| Quick Start | Your first prior in 5 minutes |
| Applied Guide | Complete elicitation workflow |
| Dual-Anchor | Control cluster counts AND weights |
| Diagnostics | Verify your prior behaves as intended |
| Case Studies | Real-world applications |
Methodological Researchers Track
| Vignette | Description |
|---|---|
| Theory Overview | Mathematical foundations |
| Stirling Numbers | Antoniak distribution details |
| Approximations | A1 closed-form theory |
| Newton Algorithm | A2 exact moment matching |
| Weight Distributions | w₁, ρ, and dual-anchor |
| API Reference | Complete function documentation |
Citation
If you use DPprior in your research, please cite:
@Manual{DPprior2025,
title = {{DPprior}: Principled Prior Elicitation for {Dirichlet} Process Mixture Models},
author = {JoonHo Lee},
year = {2025},
note = {R package version 1.0.0},
url = {https://github.com/joonho112/DPprior},
}
@Article{Lee2025multisite,
title = {Improving the Estimation of Site-Specific Effects and Their Distribution in Multisite Trials},
author = {JoonHo Lee and Jonathan Che and Sophia Rabe-Hesketh and Avi Feller and Luke Miratrix},
journal = {Journal of Educational and Behavioral Statistics},
year = {2025},
volume = {50},
number = {5},
pages = {731--764},
doi = {10.3102/10769986241254286},
}Related Work
This package builds on methodological foundations from:
- Dorazio (2009): Original DORO approach for K-based elicitation
- Lee et al. (2025): Informative priors via χ² distribution on K
- Vicentini & Jermyn (2025): Sample-size-independent approaches and weight-based elicitation
- Zito et al. (2024): Stirling-gamma priors and negative binomial approximation
Support
This project was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D240078 to University of Alabama.
The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.
References
Dorazio, R. M. (2009). On selecting a prior for the precision parameter of Dirichlet process mixture models. Journal of Statistical Planning and Inference, 139(10), 3384–3390.
Lee, J., Che, J., Rabe-Hesketh, S., Feller, A., & Miratrix, L. (2025). Improving the estimation of site-specific effects and their distribution in multisite trials. Journal of Educational and Behavioral Statistics, 50(5), 731–764.
Vicentini, C., & Jermyn, I. H. (2025). Prior selection for the precision parameter of Dirichlet process mixtures. arXiv:2502.00864.
Zito, A., Rigon, T., & Dunson, D. B. (2024). Bayesian nonparametric modeling of latent partitions via Stirling-gamma priors. arXiv:2306.02360.
License
MIT © JoonHo Lee