Skip to contents

Data from the High School and Beyond (HS&B) longitudinal study, a nationally representative survey of U.S. high school students conducted by the National Center for Education Statistics (NCES). This full sample of 350 students includes academic achievement scores, socioeconomic status, and school program information. Used in regression teaching to illustrate model diagnostics, multicollinearity among correlated achievement measures, and multiple regression with both continuous and categorical predictors.

Usage

hsbs1

Format

A tibble with 350 rows and 16 columns:

id

Student identifier. Type: numeric. Range: (1, 599).

gender

Student gender. Type: numeric. Values: 1 = male, 2 = female.

race

Student race/ethnicity. Type: numeric. Values: 1 = Hispanic, 2 = Asian, 3 = African American, 4 = White.

ses

Socioeconomic status. Type: numeric. Values: 1 = low, 2 = middle, 3 = high.

sch

School type. Type: numeric. Values: 1 = public, 2 = private.

prog

Academic program. Type: numeric. Values: 1 = general, 2 = academic, 3 = vocational.

locus

Locus of control composite score. Type: numeric. Range: (-2.23, 1.36). Higher values indicate more internal locus of control.

concept

Self-concept composite score. Type: numeric. Range: (-2.62, 1.19). Higher values indicate more positive self-concept.

mot

Motivation. Type: numeric. Range: (0, 1). Proportion-based measure of academic motivation.

career

Career choice code. Type: numeric. Range: (1, 17). Categorical code for intended career.

read

Reading achievement test score (standardized). Type: numeric. Range: (28.3, 73.3). Mean approximately 52.

write

Writing achievement test score (standardized). Type: numeric. Range: (28.1, 67.1). Mean approximately 52.

math

Mathematics achievement test score (standardized). Type: numeric. Range: (31.8, 75.5). Mean approximately 51.

sci

Science achievement test score (standardized). Type: numeric. Range: (26.0, 74.2). Mean approximately 52.

ss

Social studies achievement test score (standardized). Type: numeric. Range: (25.7, 70.5). Mean approximately 52.

sample

Sample weight indicator. Type: numeric. Constant value of 1 for all observations in this subsample.

Source

National Center for Education Statistics (NCES). High School and Beyond (HS&B) Longitudinal Study. U.S. Department of Education. Original data file: hsbs1.dta

Details

This dataset is used in Chapter 8 (Model Diagnostics) to illustrate regression diagnostic techniques including residual analysis, leverage and influence measures, and detection of outliers. It also serves as a demonstration dataset for model building with multiple correlated predictors (the five achievement test scores) and for examining the effects of categorical predictors such as program type and SES on academic outcomes. Key analyses include: multiple regression of writing scores on reading and math, residual plots, Cook's distance, leverage diagnostics, and VIF assessment for multicollinearity.

Examples

data(hsbs1)
head(hsbs1)
#> # A tibble: 6 × 16
#>      id gender  race   ses   sch  prog  locus concept   mot career  read write
#>   <int>  <int> <int> <int> <int> <int>  <dbl>   <dbl> <dbl>  <int> <dbl> <dbl>
#> 1   303      2     4     1     1     1 -0.840 -0.240  1          4  54.8  64.5
#> 2   404      2     4     2     1     1 -0.380 -0.470  0.670      4  62.7  43.7
#> 3   225      1     4     2     1     2  0.890  0.590  0.670      6  60.6  56.7
#> 4   553      1     4     2     2     2  0.710  0.280  0.670     16  62.7  56.7
#> 5   433      2     4     2     1     3 -0.640  0.0300 1          4  41.6  46.3
#> 6   189      2     4     3     1     2  1.11   0.900  0.330      6  62.7  64.5
#> # ℹ 4 more variables: math <dbl>, sci <dbl>, ss <dbl>, sample <int>
# Multiple regression of writing on reading and math
fit <- lm(write ~ read + math, data = hsbs1)
summary(fit)
#> 
#> Call:
#> lm(formula = write ~ read + math, data = hsbs1)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -18.1588  -4.8532   0.1208   4.9720  17.4686 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 13.87143    2.29829   6.036 4.07e-09 ***
#> read         0.35117    0.05323   6.597 1.57e-10 ***
#> math         0.39387    0.05852   6.730 7.01e-11 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Residual standard error: 7.205 on 347 degrees of freedom
#> Multiple R-squared:  0.4539,	Adjusted R-squared:  0.4507 
#> F-statistic: 144.2 on 2 and 347 DF,  p-value: < 2.2e-16
#>