Table of Contents
Fetching ...

National and state-level datasets of United States forensic DNA databases 2001--2025

Yemko Pryor, Joao Pedro Donadio, Samantha C. Muller, Jenna Wilson, Tina Lasisi

TL;DR

The paper tackles the lack of harmonized, longitudinal data on U.S. forensic DNA databases by constructing three integrated datasets: a national NDIS time series (2001–2025) from archived FBI pages, a state-level SDIS dataset with arrestee counts and policy metadata, and FOIA-derived demographic and annual collection data. It employs a three-pronged methodological approach—reconstructing federal statistics via the Wayback Machine, compiling state policies and counts, and digitizing Murphy & Tong appendices—coupled with rigorous validation including anomaly detection and external calibration. The contributions enable robust longitudinal and cross-jurisdictional analyses of database growth, governance, and reporting practices, with transparent, versioned data and reproducible code. The resources facilitate assessment of policy impact, inter-state differences, and the historical evolution of CODIS infrastructure, with public availability on Zenodo and GitHub to support reuse across research and policy applications.

Abstract

Forensic DNA databases in the United States have expanded substantially over the past two decades. However, comprehensive, harmonized data describing database structure and composition remain limited. This dataset series documents forensic DNA infrastructure across national and state levels from 2001 to 2025. It includes a reconstructed time series of monthly National DNA Index System (NDIS) statistics from FBI archives, capturing counts of offender, arrestee, and forensic profiles, participating laboratory totals, and investigations aided. A complementary dataset compiles publicly available state-level statistics and policy metadata on arrestee collection laws, familial search practices, and DNA collection statutes across all 50 states. A third dataset provides standardized demographic and annual collection data obtained through previously published public records requests, including racial and gender composition where reported. Together, these resources provide a foundation for studying the historical development of forensic DNA systems in the U.S., enabling longitudinal and cross-sectional analyses of database growth, policy variation, and reporting practices across jurisdictions.

National and state-level datasets of United States forensic DNA databases 2001--2025

TL;DR

The paper tackles the lack of harmonized, longitudinal data on U.S. forensic DNA databases by constructing three integrated datasets: a national NDIS time series (2001–2025) from archived FBI pages, a state-level SDIS dataset with arrestee counts and policy metadata, and FOIA-derived demographic and annual collection data. It employs a three-pronged methodological approach—reconstructing federal statistics via the Wayback Machine, compiling state policies and counts, and digitizing Murphy & Tong appendices—coupled with rigorous validation including anomaly detection and external calibration. The contributions enable robust longitudinal and cross-jurisdictional analyses of database growth, governance, and reporting practices, with transparent, versioned data and reproducible code. The resources facilitate assessment of policy impact, inter-state differences, and the historical evolution of CODIS infrastructure, with public availability on Zenodo and GitHub to support reuse across research and policy applications.

Abstract

Forensic DNA databases in the United States have expanded substantially over the past two decades. However, comprehensive, harmonized data describing database structure and composition remain limited. This dataset series documents forensic DNA infrastructure across national and state levels from 2001 to 2025. It includes a reconstructed time series of monthly National DNA Index System (NDIS) statistics from FBI archives, capturing counts of offender, arrestee, and forensic profiles, participating laboratory totals, and investigations aided. A complementary dataset compiles publicly available state-level statistics and policy metadata on arrestee collection laws, familial search practices, and DNA collection statutes across all 50 states. A third dataset provides standardized demographic and annual collection data obtained through previously published public records requests, including racial and gender composition where reported. Together, these resources provide a foundation for studying the historical development of forensic DNA systems in the U.S., enabling longitudinal and cross-sectional analyses of database growth, policy variation, and reporting practices across jurisdictions.

Paper Structure

This paper contains 13 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: FBI formatting changes in NDIS statistics pages. Variations in the structure and organization of FBI NDIS webpages captured through FBI.gov in different eras. Webpages are grouped into five eras used to build the NDIS time series dataset. Eras of NDIS webpages are broken down into the pre-2007 era with jurisdiction-specific HTMLs, unified HTMLs with jurisdiction-based dividers from 2008-2010, modified section headers in 2011-2016, and more modernized unified pages in the post-2017 eras.
  • Figure 2: Parsing logic for NDIS statistics scraping. A diagram depicting the step-by-step approach used to scrape the Wayback Machine and consolidate FBI NDIS statistics using era-specific parsers, allowing for a validated cleaned dataset of complied NDIS time series data.
  • Figure 3: SDIS policies and FOIA availability mapped by state. (A) Arrestee DNA Collection Policy availability coded as yes (blue) or no(gray) variables, (B) familial Search Policy detailing states with permitted(green), prohibited(black) and unspecified(blue) data availability, and (C) FOIA Response Status coding availability status as not_provided(red) or provided(blue).
  • Figure 4: File and folder structure. A file map showcasing the hierarchical organization of directories and data sources present in the public GitHub repository: https://github.com/lasisilab/PODFRIDGE-Databases
  • Figure 5: Anomalies distribution per state. A stacked bar plot detailing anomalies in Offender Profiles (magenta), Forensic Profiles (pink), Arrestee Profiles (cyan), Investigations Aided (turquiose) and NDIS Labs (purple) cataloged by state and corrected across all metrics and jurisdictions. Wyoming's high values can be explained by flagged value propagation in the early years, which may be due to periods where no updates were made.
  • ...and 2 more figures