Skip to contents

Principled Prior Elicitation for Dirichlet Process Mixture Models

The DPprior package provides tools for eliciting Gamma hyperpriors on the concentration parameter α in Dirichlet Process (DP) mixture models. Rather than requiring researchers to reason about the abstract parameter α, DPprior allows specification through intuitive quantities:

  • Expected cluster counts: “How many distinct groups do you anticipate?”
  • Weight concentration: “How evenly distributed do you expect observations across clusters?”

These natural questions are translated into principled Gamma(a, b) hyperpriors using computationally efficient algorithms backed by exact moment matching.

Installation

# Install from GitHub
# install.packages("devtools")
devtools::install_github("joonho112/DPprior")

# Install from CRAN, if available
# install.packages("DPprior")

Quick Start

library(DPprior)

# Scenario: 50-site multisite trial, expecting ~5 distinct effect patterns
fit <- DPprior_fit(
  J = 50,                 # Number of sites
  mu_K = 5,               # Expected clusters
  confidence = "medium",  # Moderate uncertainty
  warn_dominance = FALSE  # Keep quick-start output concise
)

# View the elicited prior
print(fit)
#> DPprior Prior Elicitation Result
#> =============================================
#>
#> Gamma Hyperprior: α ~ Gamma(a = 1.4082, b = 1.0770)
#>   E[α] = 1.308, SD[α] = 1.102
#>
#> Target (J = 50):
#>   E[K_J]   = 5.00
#>   Var(K_J) = 10.00
#>   (from confidence = 'medium')
#>
#> Achieved:
#>   E[K_J] = 5.000000, Var(K_J) = 10.000000
#>   Residual = 3.94e-10
#>
#> Method: A2-MN (7 iterations)
#>
#> Dominance Risk: HIGH ✘ (P(w₁>0.5) = 50%)

# Visualize the prior
plot(fit)

Key Features

1. Intuitive Elicitation

Specify priors through expected cluster counts and uncertainty levels:

# Using confidence levels (recommended for most users)
fit <- DPprior_fit(J = 50, mu_K = 5, confidence = "medium")

# Or specify variance directly
fit <- DPprior_fit(J = 50, mu_K = 5, var_K = 10)

2. Dual-Anchor Control

Go beyond cluster counts to control first stick-breaking weight behavior, addressing the “unintended prior” problem (Vicentini & Jermyn, 2025):

# First, fit K-based prior
fit_K <- DPprior_fit(J = 50, mu_K = 5, var_K = 8, warn_dominance = FALSE)

# Check whether the first size-biased cluster might dominate
prob_w1_exceeds(0.5, fit_K$a, fit_K$b)
#> [1] 0.481478

# Apply dual-anchor constraint
w1_target <- list(prob = list(threshold = 0.5, value = 0.30))
fit_dual <- DPprior_dual(fit_K, w1_target, lambda = 0.5)

# Verify the trade-off
prob_w1_exceeds(0.5, fit_dual$a, fit_dual$b)
#> [1] 0.4379077

3. Comprehensive Diagnostics

Verify your prior behaves as intended across all relevant dimensions: - K distribution (cluster counts) - w₁ distribution (first stick-breaking / size-biased cluster weight) - ρ distribution (co-clustering probability) - α distribution (concentration parameter)

fit <- DPprior_fit(J = 50, mu_K = 5, check_diagnostics = TRUE)
plot(fit)  # Four-panel diagnostic dashboard
summary(fit)  # Detailed numerical diagnostics

4. Fast Computation

The package implements the Design-Conditional Elicitation (DCE) methodology via Two-Stage Moment Matching (TSMM):

  • A1 (Closed-form): Instant initial estimates using Negative Binomial approximation
  • A2 (Newton refinement): Exact moment matching via damped Newton iterations
# A1 only (fastest, approximate)
fit_fast <- DPprior_fit(J = 50, mu_K = 5, var_K = 10, method = "A1")

# A2 with Newton refinement (default, exact)
fit_exact <- DPprior_fit(J = 50, mu_K = 5, var_K = 10, method = "A2-MN")

When to Use DPprior

DPprior is particularly valuable for:

  • Multisite randomized trials with moderate numbers of sites (J = 20–200)
  • Meta-analysis with flexible heterogeneity modeling
  • Bayesian nonparametric density estimation in small-to-moderate samples
  • Low-information settings where the prior on α substantially influences posterior inference (Lee et al., 2025)

Vignettes

The package includes comprehensive documentation:

Applied Researchers Track

Vignette Description
Introduction Why prior elicitation matters
Quick Start Your first prior in 5 minutes
Applied Guide Complete elicitation workflow
Dual-Anchor Control cluster counts AND weights
Diagnostics Verify your prior behaves as intended
Case Studies Real-world applications

Methodological Researchers Track

Vignette Description
Theory Overview Mathematical foundations
Stirling Numbers Antoniak distribution details
Approximations A1 closed-form theory
Newton Algorithm A2 exact moment matching
Weight Distributions w₁, ρ, and dual-anchor
API Reference Complete function documentation

Citation

If you use DPprior in your research, please cite:

@Manual{DPprior2026,
  title = {{DPprior}: Principled Prior Elicitation for {Dirichlet} Process Mixture Models},
  author = {JoonHo Lee},
  year = {2026},
  note = {R package version 1.1.0},
  url = {https://github.com/joonho112/DPprior},
}

@Article{Lee2025multisite,
  title = {Improving the Estimation of Site-Specific Effects and Their Distribution in Multisite Trials},
  author = {JoonHo Lee and Jonathan Che and Sophia Rabe-Hesketh and Avi Feller and Luke Miratrix},
  journal = {Journal of Educational and Behavioral Statistics},
  year = {2025},
  volume = {50},
  number = {5},
  pages = {731--764},
  doi = {10.3102/10769986241254286},
}

@Article{Lee2026dce,
  title = {Design-Conditional Prior Elicitation for {Dirichlet} Process Mixtures},
  author = {JoonHo Lee},
  journal = {arXiv preprint},
  year = {2026},
  eprint = {2602.06301},
  archiveprefix = {arXiv},
  url = {https://arxiv.org/abs/2602.06301},
}

This package builds on methodological foundations from:

  • Dorazio (2009): Original approach for K-based elicitation
  • Lee et al. (2025): Informative priors via χ² distribution on K
  • Vicentini & Jermyn (2025): Sample-size-independent approaches and weight-based elicitation
  • Zito et al. (2024): Stirling-gamma priors and negative binomial approximation

Support

This research was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D240078 to the University of Alabama.

https://ies.ed.gov/use-work/awards/improving-estimation-site-specific-effects-and-their-distribution-multisite-trials-practical-tools

The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.

References

  • Dorazio, R. M. (2009). On selecting a prior for the precision parameter of Dirichlet process mixture models. Journal of Statistical Planning and Inference, 139(10), 3384–3390.

  • Lee, J., Che, J., Rabe-Hesketh, S., Feller, A., & Miratrix, L. (2025). Improving the estimation of site-specific effects and their distribution in multisite trials. Journal of Educational and Behavioral Statistics, 50(5), 731–764.

  • Vicentini, C., & Jermyn, I. H. (2025). Prior selection for the precision parameter of Dirichlet process mixtures. arXiv:2502.00864.

  • Zito, A., Rigon, T., & Dunson, D. B. (2024). Bayesian nonparametric modeling of latent partitions via Stirling-gamma priors. arXiv:2306.02360.

License

MIT © JoonHo Lee