Skip to contents

A subset of UC Berkeley's 1973 graduate admissions data, restricted to two departments (Engineering and Psychology) with 1,169 applicants. This streamlined version of the Berkeley data is used to illustrate Simpson's paradox and confounding in multiple logistic regression, where the apparent association between gender and admission reverses after controlling for department.

Usage

berk_sub

Format

A tibble with 1,169 rows and 3 columns:

admitted

Admission decision. Type: numeric. Binary indicator (0/1) where 1 = admitted, 0 = not admitted. This is the primary outcome variable. Overall admission rate is 44%.

male

Sex of applicant. Type: numeric. Binary indicator (0/1) where 1 = male, 0 = female. Approximately 64% of applicants are male.

engineering

Department applied to. Type: numeric. Binary indicator (0/1) where 1 = Engineering, 0 = Psychology. The two departments differ substantially in both gender composition and admission rates.

Source

Bickel, P. J., Hammel, E. A., & O'Connell, J. W. (1975). Sex bias in graduate admissions: Data from Berkeley. Science, 187(4175), 398-404. Original data file: berk_sub.dta

Details

This dataset is used in Chapter 10 (Multiple Logistic Regression) to illustrate confounding and omitted variable bias in logistic regression. Key analyses include: computing unadjusted odds ratios for admission by gender, fitting a multiple logistic regression controlling for department, and demonstrating how the apparent gender effect (unadjusted OR = 3.25) largely disappears after adjusting for department (adjusted OR approximately 1.1).

The Engineering department has a much higher proportion of male applicants (approximately 96%) compared to Psychology (approximately 33%), and also has a higher overall admission rate (approximately 63%) than Psychology (approximately 25%). This creates the conditions for Simpson's paradox: the omitted variable (department) is both correlated with the predictor (gender) and a predictor of the outcome (admission).

Examples

data(berk_sub)
head(berk_sub)
#> # A tibble: 6 × 3
#>   admitted  male engineering
#>      <int> <int>       <int>
#> 1        0     1           1
#> 2        0     0           1
#> 3        0     1           0
#> 4        0     0           0
#> 5        1     1           1
#> 6        1     0           1

# Unadjusted logistic regression: gender effect on admission
glm(admitted ~ male, data = berk_sub, family = binomial)
#> 
#> Call:  glm(formula = admitted ~ male, family = binomial, data = berk_sub)
#> 
#> Coefficients:
#> (Intercept)         male  
#>      -1.017        1.180  
#> 
#> Degrees of Freedom: 1168 Total (i.e. Null);  1167 Residual
#> Null Deviance:	    1605 
#> Residual Deviance: 1520 	AIC: 1524

# Adjusted logistic regression: controlling for department
glm(admitted ~ male + engineering, data = berk_sub, family = binomial)
#> 
#> Call:  glm(formula = admitted ~ male + engineering, family = binomial, 
#>     data = berk_sub)
#> 
#> Coefficients:
#> (Intercept)         male  engineering  
#>     -1.1309       0.1241       1.5551  
#> 
#> Degrees of Freedom: 1168 Total (i.e. Null);  1166 Residual
#> Null Deviance:	    1605 
#> Residual Deviance: 1428 	AIC: 1434