Dataset Overview and Course Mapping • regdatasets

All 25 Datasets at a Glance

library(regdatasets)

# Build summary table
datasets <- c(
  "alcohol1_pp", "berk_sub", "berkeley", "civic_ed", "classdata_07",
  "crime", "disc", "disc2", "faculty", "gcse", "grades", "gss_1",
  "hsb_sub", "hsbs1", "individuals", "instruction", "lambert",
  "nels_data", "penalty", "pisa2000", "reading",
  "satisfaction", "titanic", "womenlf"
)

summary_df <- data.frame(
  Dataset = datasets,
  Rows = integer(length(datasets)),
  Cols = integer(length(datasets)),
  stringsAsFactors = FALSE
)

for (i in seq_along(datasets)) {
  data(list = datasets[i], package = "regdatasets")
  df <- get(datasets[i])
  summary_df$Rows[i] <- nrow(df)
  summary_df$Cols[i] <- ncol(df)
}

knitr::kable(summary_df, format = "html",
             caption = "Dimensions of all 25 datasets in regdatasets.",
             col.names = c("Dataset", "Observations", "Variables"))

Dimensions of all 25 datasets in regdatasets.
Dataset	Observations	Variables
alcohol1_pp	246	9
berk_sub	1169	3
berkeley	3593	3
civic_ed	794	237
classdata_07	44	12
crime	67	5
disc	2000	15
disc2	2000	15
faculty	514	10
gcse	4059	6
grades	198	8
gss_1	2832	16
hsb_sub	188	7
hsbs1	350	16
individuals	55899	6
instruction	1190	12
lambert	492	109
nels_data	2000	299
penalty	674	3
pisa2000	4528	15
reading	66	8
satisfaction	110	3
titanic	1309	6
womenlf	263	5

Course Chapter Mapping

The table below shows which datasets are used in each chapter of the companion textbook.

Chapter	Topic	Primary Datasets
1	Simple Linear Regression	`gcse`, `classdata_07`, `pisa2000`
2	ANCOVA	`gcse`, `instruction`
3	One-way ANOVA	`reading`, `instruction`
4	Continuous Predictors	`crime`
5	Interactions	`gcse`, `individuals`, `faculty`, `crime`
6	Nonlinear Relationships	`nels_data`, `crime`
7	Model Building	`nels_data`, `hsb_sub`, `civic_ed`
8	Model Diagnostics	`hsbs1`, `nels_data`
9	Simple Logistic Regression	`gss_1`, `disc`, `titanic`
10	Multiple Logistic Regression	`berkeley`, `berk_sub`, `disc2`, `titanic`
11	Logistic Model Fit	`penalty`, `disc`
12	Latent Response & GLM	`lambert`, `grades`
13	Ordinal Models	`womenlf`, `satisfaction`, `alcohol1_pp`

Datasets by Topic Area

Linear Regression Foundation

These datasets support the core linear regression chapters (1-8):

gcse (4,059 x 9): The primary running example. GCSE exam scores and London Reading Test scores from 65 schools. Used across Chapters 1, 2, and 5 for SLR, ANCOVA, and interactions.
crime (67 x 5): Florida county-level crime rates. Demonstrates the confounding/sign-reversal phenomenon between education and crime once urbanization is controlled.
instruction (1,190 x 12): Math achievement gains nested in classrooms and schools. Supports ANOVA and ANCOVA demonstrations.
reading (66 x 8): Three-group reading instruction experiment. Clean example for one-way ANOVA with clear group differences.

Logistic Regression

These datasets support the binary outcome chapters (9-11):

berkeley (3,593 x 3): The classic Simpson’s paradox dataset. UC Berkeley graduate admissions by gender and department.
gss_1 (2,832 x 16): General Social Survey 1982/1994. Attitude change over time as a simple logistic regression example.
penalty (674 x 3): Death penalty sentencing data by defendant and victim race. Used for goodness-of-fit testing.
titanic (1,309 x 6): Passenger survival data with class, sex, age, and fare as predictors.

Ordinal and Advanced Models

womenlf (263 x 5): Canadian women’s labor force participation with a three-level ordinal outcome.
satisfaction (110 x 3): Four-level ordinal satisfaction data for ordinal logistic regression.
lambert (492 x 109): Rich longitudinal dataset for latent response model demonstrations.

Data Sources

Most datasets originate from published educational and social science studies. Key sources include:

GCSE data: Goldstein et al. (1993), UK examination council records
PISA 2000: OECD Programme for International Student Assessment
Berkeley admissions: Bickel, Hammel & O’Connell (1975)
General Social Survey: NORC at the University of Chicago
NAEP: National Center for Education Statistics
NELS:88: National Education Longitudinal Study of 1988
Titanic: British Board of Trade records

See individual dataset documentation (?dataset_name) for complete source citations.