Package Architecture and Data Dictionary
Source:vignettes/reference-architecture.Rmd
reference-architecture.RmdS3 Class Hierarchy
ALprekDB uses S3 classes to represent data at each processing stage. Each class carries both the data and metadata (school year, format, processing timestamps).
Budget Pipeline
budget_read() --> alprek_budget_raw
budget_clean() --> alprek_budget_long
budget_transform() --> alprek_budget_master
budget_validate() --> alprek_budget_validation
budget_bind_years() --> alprek_budget_panel
Classroom Pipeline
classroom_read() --> alprek_classroom_raw
classroom_clean() --> alprek_classroom_clean
classroom_validate() --> alprek_classroom_validation
classroom_bind_years() --> alprek_classroom_panel
Student Pipeline
student_read() --> alprek_student_raw
student_clean() --> alprek_student_clean
student_validate() --> alprek_student_validation
student_bind_years() --> alprek_student_panel
student_transform() --> alprek_student_panel (enriched, same class)
S3 Object Structure
All panel objects share a common structure:
budget <- alprek_synthetic_budget(n_classrooms = 5, n_years = 2, seed = 1)
names(budget)
#> [1] "data" "years" "n_years" "by_year"
cat("$data: tibble with", nrow(budget$data), "rows x",
ncol(budget$data), "cols\n")
#> $data: tibble with 10 rows x 38 cols
cat("$years:", paste(budget$years, collapse = ", "), "\n")
#> $years: 2021-2022, 2022-2023
cat("$n_years:", budget$n_years, "\n")
#> $n_years: 2
cat("$by_year: list with per-year metadata\n")
#> $by_year: list with per-year metadataThe alprek_linkage_master has a special two-level
structure:
# master$classroom_level -- tibble (classroom-year level)
# master$student_level -- tibble (student-year level)
# master$diagnostics -- list of join diagnostics
# master$meta -- years, counts, timestampsFormat Detection
ADECE changed file formats in 2024-25. ALprekDB auto-detects the format:
| Module | Legacy (2021-2024) | New (2024-2025) | Detection Method |
|---|---|---|---|
| Budget | ~176 columns, “Additional Funds 1/2” | ~28 columns, “OSR/Other” | Column count + name pattern |
| Classroom | ~100 columns | ~125 columns (seat count, DOBs) | Column count + “Seat Count” presence |
| Student | 202 columns | 270 columns (child names, GOLD Growth) | Column count + “Child First Name” |
Format-specific differences are handled transparently by the cleaning functions.
Data Dictionary
Budget Panel Columns (53)
| Column Group | Example Columns | Type |
|---|---|---|
| Identifiers |
school_year, classroom_code,
classroom_name
|
factor/character |
| OSR Amounts |
osr_lead_teacher_salary,
osr_aux_teacher_benefits, … |
numeric |
| Other Amounts |
other_lead_teacher_salary,
other_equipment, … |
numeric |
| Totals |
osr_total, other_total,
grand_total
|
numeric |
| Shares |
share_osr, share_lt_salary,
share_teacher_compensation
|
numeric (0-1) |
| Delivery |
delivery_type, delivery_type_binary,
delivery_type_3way
|
factor |
Classroom Panel Columns (125)
| Column Group | Example Columns | Type |
|---|---|---|
| Identifiers |
classroom_code, school_year,
region, county_code
|
factor/character/integer |
| Delivery |
delivery_type (7-level factor) |
factor |
| Geography |
latitude, longitude,
senate_dist, house_dist
|
numeric/integer |
| Funding |
total_grant, enhancement_grant
|
numeric |
| Lead Teacher |
lead_tch_degree_level, lead_tch_race,
lead_tch_gender, lead_tch_total_exp
|
factor/integer |
| Aux Teacher |
aux_tch_degree_raw, aux_tch_race,
aux_tch_gender
|
factor/character |
| Staff Contacts |
monitor_name, monitor_email,
coach_name
|
character |
Student Panel Columns (288, with transform)
| Column Group | Count | Example Columns |
|---|---|---|
| Identifiers | 11 |
adece_id, classroom_code,
school_year
|
| Demographics | 8 |
gender, race, ethnicity,
dob, age
|
| Family/Income | 10 |
poverty_dum, english_learner,
single_parent, gross_income_midpoint
|
| Service Indicators | 15 |
childcare_subsidy, tanf, wic,
snap, iep, iep2
|
| Attendance | 6 |
days_absent_total, days_tardy_total
|
| GOLD (6 domains) | 60 |
gold_literacy_fall_raw,
gold_math_spring_scale,
gold_se_fall_kready
|
| GOLD Gains | 18 |
gold_literacy_gain_raw,
gold_math_kready_improved
|
| eDECA Pre/Post | 10+ |
edeca_initiative_pre_tscore,
edeca_tpf_gain
|
| Chronic Absence | 2 |
chronic_absence, chronic_absence_pct
|
| Service Density | 2 |
n_services, risk_index
|
Factor Level Reference
student <- alprek_synthetic_student(n_students = 10, n_classrooms = 5,
n_years = 1, seed = 1)
cat("gender:", paste(levels(student$data$gender), collapse = ", "), "\n")
#> gender: Male, Female
cat("race:", paste(levels(student$data$race), collapse = ", "), "\n")
#> race: White, Black, Latino/Hispanic, Asian, Mixed, Other, Unknown
cat("delivery_type:", paste(levels(student$data$delivery_type), collapse = ", "), "\n")
#> delivery_type: Public School, Private Child Care, Head Start, Community Organization, Faith-Based Organization, University Operated, Private School
cat("GOLD WHE:", paste(levels(student$data$gold_literacy_fall_whe), collapse = ", "), "\n")
#> GOLD WHE: Below, Meet, Exceed
cat("GOLD K-Ready:", paste(levels(student$data$gold_literacy_fall_kready), collapse = ", "), "\n")
#> GOLD K-Ready: Emerging, AccomplishedColumn Naming Conventions
ALprekDB follows consistent naming patterns:
| Pattern | Meaning | Example |
|---|---|---|
osr_* |
OSR (state) funding amount | osr_lead_teacher_salary |
other_* |
Other (non-state) funding | other_equipment |
share_* |
Proportion of grand total | share_osr |
gold_[domain]_[season]_* |
GOLD assessment | gold_literacy_fall_raw |
gold_[domain]_gain_* |
Fall-to-spring change | gold_literacy_gain_scale |
edeca_[construct]_* |
eDECA assessment | edeca_initiative_pre_tscore |
lead_tch_* |
Lead teacher attribute | lead_tch_degree_level |
aux_tch_* |
Auxiliary teacher attribute | aux_tch_race |
*_dum |
Binary dummy (0/1) |
poverty_dum, head_start_dum
|
Validation Check Reference
Budget Validation (7 checks)
-
required_columns- Core columns exist -
classroom_code_format- Valid classroom code format -
positive_totals- Grand totals > 0 -
share_bounds- Shares between 0 and 1 -
total_reconciliation- OSR + Other = Grand Total (within $1) -
delivery_type_valid- Known delivery types -
school_year_consistent- All rows same school year
Classroom Validation (10 checks)
-
required_columns- Core columns exist -
classroom_code_format- Valid format -
coordinate_range- Alabama lat/lon bounds -
delivery_type_valid- Known delivery types -
credential_valid- Degree levels recognized -
experience_range- Teaching years [0, 50] -
grant_positive- Grant amounts > 0 -
region_valid- Regions 1-11 -
year_first_funded_range- [1995, current year] -
school_year_consistent- Same school year
Student Validation (12 checks)
-
required_columns- Core columns exist -
adece_id_format- ADECE IDs not all NA -
student_unique- No duplicate ADECE IDs -
delivery_type_coverage- All 7 types represented -
gender_distribution- Male/Female 40-60% range -
missing_rates- Key variables below thresholds -
age_range- Pre-K age [3, 7] -
attendance_range- [0, 180] days -
income_parse_rate- Income parsed >= 70% -
gold_completeness- GOLD fall/spring >= 80% -
assessment_consistency- Scale scores non-negative -
school_year_consistent- Same school year
Linkage Validation (8 checks)
-
required_columns- Join keys exist -
key_uniqueness- No duplicate keys at classroom level -
match_rate- Join coverage >= 95% -
orphan_count- Unmatched rows reported -
na_introduced- Unexpected NAs from join -
year_coverage- All expected years present -
row_count_consistency- Row counts as expected -
region_consistency- Region values match across modules
Extending the Package
Adding a New Year’s Data
# 1. Process the new year
config_2526 <- student_config("2025-2026", "Student_25-26.xlsx")
result_2526 <- student_process(config_2526)
# 2. Add to existing DuckDB database
conn <- db_init("output/alprekdb.duckdb")
db_write_year(conn, result_2526$clean, table = "student_panel")
db_close(conn)Classroom Code Structure
Classroom codes encode organizational structure. Use
parse_classroom_codes() for analysis:
codes <- c("001P12345.01", "067H54321.02", "033C98765.03")
result <- parse_classroom_codes(codes)
result
#> # A tibble: 3 × 5
#> county_code delivery_type_code program_code class_num delivery_type
#> <chr> <chr> <chr> <chr> <chr>
#> 1 001 P 12345 01 Public School
#> 2 067 H 54321 02 Head Start
#> 3 033 C 98765 03 Private Child CareThe delivery type code maps to:
| Code | Delivery Type |
|---|---|
| P | Public School |
| C | Community Organization |
| H | Head Start |
| O | Faith-Based Organization |
| F | University Operated |
| U | Private School |
| S | Private Child Care |