Table of Contents
Fetching ...

Governance by Evidence: Regulated Predictors in Decision-Tree Models

Alexios Veskoukis, Dimitris Kalles

TL;DR

The paper investigates how predictors reported in literature for decision-tree models align with law-defined regulated data categories across sectors and years. It constructs a corpus of decision-tree papers, maps reported predictors to a 13-class RDC ontology anchored to EU and US statutes, and validates these mappings through a multi-stage audit with a global correction factor. The results reveal dominant regulatory exposure under GDPR and HIPAA, with Health_Clinical as the top RDC and healthcare dominating industry shares, while other regulations show more narrow coverage. The work highlights implications for privacy-preserving ML, governance checks, and transparent reporting, providing replication artifacts to advance accountability in applied ML practice.

Abstract

Decision-tree methods are widely used on structured tabular data and are valued for interpretability across many sectors. However, published studies often list the predictors they use (for example age, diagnosis codes, location). Privacy laws increasingly regulate such data types. We use published decision-tree papers as a proxy for real-world use of legally governed data. We compile a corpus of decision-tree studies and assign each reported predictor to a regulated data category (for example health data, biometric identifiers, children's data, financial attributes, location traces, and government IDs). We then link each category to specific excerpts in European Union and United States privacy laws. We find that many reported predictors fall into regulated categories, with the largest shares in healthcare and clear differences across industries. We analyze prevalence, industry composition, and temporal patterns, and summarize regulation-aligned timing using each framework's reference year. Our evidence supports privacy-preserving methods and governance checks, and can inform ML practice beyond decision trees.

Governance by Evidence: Regulated Predictors in Decision-Tree Models

TL;DR

The paper investigates how predictors reported in literature for decision-tree models align with law-defined regulated data categories across sectors and years. It constructs a corpus of decision-tree papers, maps reported predictors to a 13-class RDC ontology anchored to EU and US statutes, and validates these mappings through a multi-stage audit with a global correction factor. The results reveal dominant regulatory exposure under GDPR and HIPAA, with Health_Clinical as the top RDC and healthcare dominating industry shares, while other regulations show more narrow coverage. The work highlights implications for privacy-preserving ML, governance checks, and transparent reporting, providing replication artifacts to advance accountability in applied ML practice.

Abstract

Decision-tree methods are widely used on structured tabular data and are valued for interpretability across many sectors. However, published studies often list the predictors they use (for example age, diagnosis codes, location). Privacy laws increasingly regulate such data types. We use published decision-tree papers as a proxy for real-world use of legally governed data. We compile a corpus of decision-tree studies and assign each reported predictor to a regulated data category (for example health data, biometric identifiers, children's data, financial attributes, location traces, and government IDs). We then link each category to specific excerpts in European Union and United States privacy laws. We find that many reported predictors fall into regulated categories, with the largest shares in healthcare and clear differences across industries. We analyze prevalence, industry composition, and temporal patterns, and summarize regulation-aligned timing using each framework's reference year. Our evidence supports privacy-preserving methods and governance checks, and can inform ML practice beyond decision trees.

Paper Structure

This paper contains 86 sections, 16 equations, 19 figures, 4 tables.

Figures (19)

  • Figure 1: Identification and screening totals; (a) industry distribution over AI-relevant records ($n{=}8{,}386$); (b) per-industry counts in the final inclusion set after requiring the AI-assigned industry to match the query industry ($n{=}4{,}686$).
  • Figure 2: Predictor-validation counts conditioned on the industry-validated inclusion set. Top: Stage counts. Bottom: per-industry distribution among articles passing predictor validation.
  • Figure 3: Regulated data category assignment. Top: Stage count showing the number of unique predictors entering the assignment step. Bottom: distribution of assigned classes (classes with zero count in this step are omitted). Totals sum to $1{,}749$.
  • Figure 4: Validation outcomes. Top: Stage counts for the regulation-labeling gate. Forming pairs of possibly regulated predictors and the regulations that the predictor's RDC appears to be regulated by. Bottom: confidence distribution within pairs labeled Regulated. For downstream reporting, only Regulated+High ($n{=}2{,}329$) are carried forward.
  • Figure 5: Dataset construction pipeline. The main branch builds the decision-tree corpus and predictor table; the parallel branch extracts regulation fragments and maps them to regulated data categories (RDCs). Their RDC-based join yields predictor--regulation pairs for LLM validation and auditing.
  • ...and 14 more figures