Working with Latent Distributions
Source:vignettes/latent_distributions.Rmd
latent_distributions.RmdOverview
The sim_latentG() function generates latent abilities
(person parameters
)
for IRT simulation studies. It implements the population model:
where is a flexible distribution family that can take many different shapes.
This vignette covers:
- The pre-standardization principle
- Available distribution shapes
- Customizing shape parameters
- Creating custom mixture distributions
- Adding covariate effects
- Visualization tools
The Pre-Standardization Principle
A key design feature of sim_latentG() is
pre-standardization: every built-in distribution shape
is mathematically constructed to have mean 0 and variance
1 before any scaling is applied.
This ensures that:
- Changing the shape does not inadvertently change the scale
- The
sigmaparameter directly controls the standard deviation - Comparisons across shapes are meaningful
The generated abilities follow:
where with and .
Why This Matters
In traditional simulation approaches, changing the latent distribution often changes both the shape and the scale simultaneously. For example, switching from to changes not just the shape but also the variance.
With pre-standardization, you can study the effect of distributional shape on IRT estimation while holding variance constant—a cleaner experimental design.
Basic Usage
# Generate 1000 standard normal abilities
sim_normal <- sim_latentG(n = 1000, shape = "normal", seed = 42)
# Examine the result
print(sim_normal)
#> Latent Ability Distribution (G-family)
#> =======================================
#> Shape : normal
#> n : 1000
#> Target mu : 0.000
#> Target sigma: 1.000
#>
#> Sample Moments:
#> Mean : -0.0258
#> SD : 1.0025
#> Skewness : -0.0038
#> Kurtosis : 0.1286 (excess)The output shows:
- Target mu/sigma: The requested location and scale
- Sample Moments: Empirical mean, SD, skewness, and excess kurtosis
Available Distribution Shapes
sim_latentG() provides 12 built-in shapes, each
pre-standardized to mean 0 and variance 1.
Standard Normal
The baseline case:
sim_normal <- sim_latentG(n = 2000, shape = "normal", seed = 1)
plot(sim_normal, show_normal = TRUE)
Bimodal Distribution
A symmetric two-component Gaussian mixture, useful for representing populations with two distinct subgroups (e.g., native vs. non-native speakers).
Mathematical construction:
The component variance ensures .
sim_bimodal <- sim_latentG(n = 2000, shape = "bimodal", seed = 1)
plot(sim_bimodal, show_normal = TRUE)
Customizing mode separation:
The delta parameter controls how far apart the modes are
(0 < delta < 1):
# Wider separation
sim_bimodal_wide <- sim_latentG(
n = 2000,
shape = "bimodal",
shape_params = list(delta = 0.95),
seed = 1
)
plot(sim_bimodal_wide, show_normal = TRUE)
Trimodal Distribution
A symmetric three-component mixture with a central peak and two side peaks.
Mathematical construction:
Components at with weights where .
Component variance:
sim_trimodal <- sim_latentG(n = 2000, shape = "trimodal", seed = 1)
plot(sim_trimodal, show_normal = TRUE)
Customizing:
# Stronger central peak
sim_trimodal_central <- sim_latentG(
n = 2000,
shape = "trimodal",
shape_params = list(
w0 = 0.5, # Weight of central component (default: 1/3)
m = 1.3 # Location of side components (default: 1.2)
),
seed = 1
)
plot(sim_trimodal_central, show_normal = TRUE)
Multimodal (Four Components)
A symmetric four-component mixture with modes at .
sim_multi <- sim_latentG(n = 2000, shape = "multimodal", seed = 1)
plot(sim_multi, show_normal = TRUE)
Skewed Distributions
Right-skewed (skew_pos): Based on standardized Gamma distribution:
This has and for any .
sim_skew_pos <- sim_latentG(n = 2000, shape = "skew_pos", seed = 1)
plot(sim_skew_pos, show_normal = TRUE)
Left-skewed (skew_neg): Simply the negation of the right-skewed distribution.
sim_skew_neg <- sim_latentG(n = 2000, shape = "skew_neg", seed = 1)
plot(sim_skew_neg, show_normal = TRUE)
Controlling skewness magnitude:
The k parameter (Gamma shape) controls skewness—smaller
values = more skewed:
# More extreme skewness (k = 2)
sim_very_skew <- sim_latentG(
n = 2000,
shape = "skew_pos",
shape_params = list(k = 2),
seed = 1
)
cat(sprintf("Default k=4 skewness: %.3f\n", sim_skew_pos$sample_moments$skewness))
#> Default k=4 skewness: 0.812
cat(sprintf("k=2 skewness: %.3f\n", sim_very_skew$sample_moments$skewness))
#> k=2 skewness: 1.244Heavy-Tailed Distribution
Based on standardized Student-t:
This has for .
sim_heavy <- sim_latentG(n = 2000, shape = "heavy_tail", seed = 1)
plot(sim_heavy, show_normal = TRUE)
Controlling tail heaviness:
The df parameter (degrees of freedom) controls tail
weight—smaller values = heavier tails:
# Very heavy tails (df = 3)
sim_very_heavy <- sim_latentG(
n = 2000,
shape = "heavy_tail",
shape_params = list(df = 3),
seed = 1
)
cat(sprintf("Default df=5 kurtosis: %.3f\n", sim_heavy$sample_moments$kurtosis))
#> Default df=5 kurtosis: 2.924
cat(sprintf("df=3 kurtosis: %.3f\n", sim_very_heavy$sample_moments$kurtosis))
#> df=3 kurtosis: 7.956Light-Tailed (Platykurtic) Distribution
A mixture distribution approximating a platykurtic shape (negative excess kurtosis).
sim_light <- sim_latentG(n = 2000, shape = "light_tail", seed = 1)
plot(sim_light, show_normal = TRUE)
Uniform Distribution
Uniform on , which has mean 0 and variance 1.
sim_uniform <- sim_latentG(n = 2000, shape = "uniform", seed = 1)
plot(sim_uniform, show_normal = TRUE)
Floor and Ceiling Effects
These represent situations where there’s a concentration of examinees at one end of the ability distribution.
Floor effect: Heavy component near the lower bound (e.g., many low-ability students in a difficult test)
sim_floor <- sim_latentG(n = 2000, shape = "floor", seed = 1)
plot(sim_floor, show_normal = TRUE)
Ceiling effect: Heavy component near the upper bound (e.g., many high-ability students in an easy test)
sim_ceiling <- sim_latentG(n = 2000, shape = "ceiling", seed = 1)
plot(sim_ceiling, show_normal = TRUE)
Comparing Multiple Shapes
The compare_shapes() function provides a convenient way
to visualize multiple distributions side-by-side:
compare_shapes(
n = 3000,
shapes = c("normal", "bimodal", "trimodal",
"skew_pos", "heavy_tail", "uniform"),
sigma = 1,
seed = 42
)
Custom Mixture Distributions
For maximum flexibility, use shape = "custom" with
mixture_spec:
# Define a custom 3-component mixture
sim_custom <- sim_latentG(
n = 2000,
shape = "custom",
mixture_spec = list(
weights = c(0.3, 0.5, 0.2), # Must sum to 1
means = c(-1.5, 0, 2), # Component means
sds = c(0.5, 0.7, 0.5) # Component SDs
),
seed = 1
)
plot(sim_custom, show_normal = TRUE)
By default, custom mixtures are automatically post-standardized to have mean 0 and variance 1. To disable this:
# Keep raw mixture parameters
sim_raw <- sim_latentG(
n = 2000,
shape = "custom",
mixture_spec = list(
weights = c(0.5, 0.5),
means = c(-1, 1),
sds = c(0.5, 0.5)
),
standardize_custom = FALSE,
seed = 1
)Adjusting Location and Scale
The mu and sigma parameters allow you to
shift and scale the distribution:
# Generate abilities with mean 100 and SD 15 (like IQ scores)
sim_iq <- sim_latentG(
n = 1000,
shape = "normal",
mu = 100,
sigma = 15,
seed = 42
)
summary(sim_iq)
#> Summary: Latent Ability Distribution
#> ====================================
#> Shape : normal
#> n : 1000
#> Target : mu = 100.00, sigma = 15.00
#> Covariates : No
#>
#> Sample Statistics:
#> Mean : 99.6126
#> SD : 15.0378
#> Median : 99.8030
#> Skewness : -0.0038
#> Kurtosis : 0.1286 (excess)
#> Range : [49.4239, 152.4296]
#>
#> Quantiles:
#> 2.5% 5% 25% 50% 75% 95% 97.5%
#> 69.8816 75.3193 89.8681 99.8030 109.9601 123.0023 128.1655This works with any shape:
# Bimodal with different scale
sim_bimodal_scaled <- sim_latentG(
n = 1000,
shape = "bimodal",
mu = 0,
sigma = 1.5, # Larger spread
seed = 42
)
cat(sprintf("Sample SD: %.3f (target: 1.5)\n", sd(sim_bimodal_scaled$theta)))
#> Sample SD: 1.481 (target: 1.5)Adding Covariate Effects
You can incorporate person-level covariates that affect ability:
# Create covariate data
n <- 1000
set.seed(42)
group <- rbinom(n, 1, 0.5) # Binary group indicator
ses <- rnorm(n) # Continuous SES measure
# Generate abilities with covariate effects
sim_cov <- sim_latentG(
n = n,
shape = "normal",
xcov = data.frame(group = group, ses = ses),
beta = c(0.5, 0.3), # Group effect = 0.5, SES effect = 0.3
seed = 42
)
# Verify covariate effects
cat("Mean ability by group:\n")
#> Mean ability by group:
cat(sprintf(" Group 0: %.3f\n", mean(sim_cov$theta[group == 0])))
#> Group 0: -0.084
cat(sprintf(" Group 1: %.3f\n", mean(sim_cov$theta[group == 1])))
#> Group 1: 0.519
cat(sprintf(" Difference: %.3f (expected: 0.5)\n",
mean(sim_cov$theta[group == 1]) - mean(sim_cov$theta[group == 0])))
#> Difference: 0.602 (expected: 0.5)The full model is:
where is the covariate vector for person .
Working with the Output Object
The sim_latentG() function returns a
latent_G object containing:
sim <- sim_latentG(n = 100, shape = "bimodal", seed = 1)
# Available components
names(sim)
#> [1] "theta" "mu" "sigma" "eta_cov"
#> [5] "shape" "shape_params" "n" "sample_moments"
#> [9] "z"
# The theta vector
head(sim$theta)
#> [1] -0.5611365 0.4327842 -0.5953282 -1.4776179 1.6598142 0.3882399
# The standardized z values (before scaling)
head(sim$z)
#> [1] -0.5611365 0.4327842 -0.5953282 -1.4776179 1.6598142 0.3882399
# Sample moments
sim$sample_moments
#> $mean
#> [1] 0.005452304
#>
#> $sd
#> [1] 0.9591602
#>
#> $skewness
#> [1] -0.1115873
#>
#> $kurtosis
#> [1] -1.109185Extracting theta for Other Uses
# Get theta as a numeric vector
theta_vec <- sim$theta
# Use in your own analysis
mean(theta_vec)
#> [1] 0.005452304Connection to IRT Framework
In the Rasch/2PL model, the latent distribution affects key quantities:
Expected Test Information
Different latent shapes produce different expected information profiles, even with identical item parameters.
Identifiability
For model identification in the Rasch model, we typically fix either:
- (location constraint), or
- (item constraint)
The sim_latentG() function generates abilities with mean
0 by default, supporting the first identification approach.
Using sim_latentG with eqc_calibrate
When using sim_latentG() as part of reliability-targeted
simulation, specify the same parameters in
eqc_calibrate():
# Calibrate for a bimodal population
eqc_result <- eqc_calibrate(
target_rho = 0.80,
n_items = 25,
model = "rasch",
latent_shape = "bimodal",
latent_params = list(delta = 0.8),
seed = 42
)
# Generate response data with the same distribution
sim_data <- simulate_response_data(
eqc_result = eqc_result,
n_persons = 1000,
latent_shape = "bimodal",
latent_params = list(delta = 0.8),
seed = 123
)Summary of Shape Parameters
| Shape | Parameter | Default | Range | Description |
|---|---|---|---|---|
bimodal |
delta |
0.8 | (0, 1) | Mode separation |
trimodal |
w0 |
1/3 | (0, 1) | Weight of central component |
m |
1.2 | > 0 | Location of side modes | |
multimodal |
m1 |
0.5 | > 0 | Inner mode locations |
m2 |
1.3 | > 0 | Outer mode locations | |
w_inner |
0.30 | (0, 0.5) | Weight of inner components | |
skew_pos/neg |
k |
4 | > 0 | Gamma shape (smaller = more skewed) |
heavy_tail |
df |
5 | > 2 | Degrees of freedom (smaller = heavier) |
floor |
w_floor |
0.3 | (0, 1) | Weight at floor |
m_floor |
-1.5 | < 0 | Floor location | |
ceiling |
w_ceil |
0.3 | (0, 1) | Weight at ceiling |
m_ceil |
1.5 | > 0 | Ceiling location |
Practical Recommendations
Choosing a Shape
| Research Question | Recommended Shape |
|---|---|
| Standard simulation | normal |
| Heterogeneous population | bimodal |
| Mixed ability levels |
trimodal or multimodal
|
| Selective samples |
skew_pos or skew_neg
|
| Robust estimation | heavy_tail |
| Easy/difficult tests |
ceiling / floor
|
| Sensitivity analysis | Compare multiple shapes |
Sample Size Considerations
For stable Monte Carlo estimates:
-
M = 10,000 or more for
eqc_calibrate()quadrature - n = 500–2000 per replication for simulation studies
- Increase n for heavy-tailed or highly multimodal shapes
Reproducibility
Always set a seed for reproducible results:
sim1 <- sim_latentG(n = 100, shape = "normal", seed = 42)
sim2 <- sim_latentG(n = 100, shape = "normal", seed = 42)
identical(sim1$theta, sim2$theta) # TRUE
#> [1] TRUEFurther Reading
For theoretical background on latent distributions in IRT, see:
Baker, F. B., & Kim, S.-H. (2004). Item Response Theory: Parameter Estimation Techniques (2nd ed.). Marcel Dekker.
Paganin, S., et al. (2023). Computational strategies and estimation performance with Bayesian semiparametric item response theory models. Journal of Educational and Behavioral Statistics, 48(2), 147-188.
For information on the pre-standardization mathematics, see Appendix F of the IRTsimrel paper.