Canonical hash for cross-machine reproducibility checks
Source:R/utils-reproducibility.R
canonical_hash.RdCompute a stable content hash of a multisiteDGP simulation object — the hash that identifies whether two simulation runs produced bit-identical results. The hash is canonical: it normalizes column order, drops row names, selects only the stable diagnostics, and replaces callback functions with presence sentinels (so the hash is invariant under callback identity but sensitive to callback presence).
Usage
canonical_hash(
x,
algo = "xxhash64",
columns_to_include = NULL,
diagnostics_to_include = NULL
)Arguments
- x
A
multisitedgp_data,multisitedgp_design, data frame, or other R object.- algo
Character. Hash algorithm passed to
digest. Default"xxhash64"— fast, 16-hex output, suitable for typical reproducibility checks.- columns_to_include
Optional character vector of columns to include for data-frame-like objects. Columns are sorted before hashing. Default
NULL(all canonical columns).- diagnostics_to_include
Optional character vector of diagnostic names to include. Default
NULL(the blueprint's numeric-diagnostics allowlist).
Details
Cross-OS policy. Linux x86_64 / amd64 is the strict hash
baseline for golden fixtures used in the package's regression tests.
macOS and Windows are held to same-machine reproducibility and
distributional parity rather than Linux byte-identical hashes — minor
floating-point divergences across platforms are expected and do not
indicate a bug. See system.file("REPRODUCIBILITY.md", package = "multisiteDGP")
for the full installed policy.
Use cases. (1) Save the hash alongside a published simulation result so future readers can verify reproduction. (2) Pin a regression test fixture so unintended pipeline changes are caught. (3) Detect whether two parallel workers produced the same output.
For a worked reproducibility walkthrough see the Reproducibility and provenance vignette.
References
Lee, J., Che, J., Rabe-Hesketh, S., Feller, A., & Miratrix, L. (2025). Improving the estimation of site-specific effects and their distribution in multisite trials. Journal of Educational and Behavioral Statistics, 50(5), 731–764. doi:10.3102/10769986241254286 .
See also
provenance_string for the human-readable one-line
provenance summary;
the M7
Reproducibility and provenance vignette.
Other family-reproducibility:
provenance_string()
Examples
dat <- sim_multisite(J = 10L, seed = 1L)
canonical_hash(dat)
#> [1] "f367529f6b9347bf"
# Same design / seed → same hash.
identical(canonical_hash(dat), canonical_hash(sim_multisite(J = 10L, seed = 1L)))
#> [1] TRUE