Governance by Evidence: Regulated Predictors in Decision-Tree Models
Alexios Veskoukis, Dimitris Kalles
TL;DR
The paper investigates how predictors reported in literature for decision-tree models align with law-defined regulated data categories across sectors and years. It constructs a corpus of decision-tree papers, maps reported predictors to a 13-class RDC ontology anchored to EU and US statutes, and validates these mappings through a multi-stage audit with a global correction factor. The results reveal dominant regulatory exposure under GDPR and HIPAA, with Health_Clinical as the top RDC and healthcare dominating industry shares, while other regulations show more narrow coverage. The work highlights implications for privacy-preserving ML, governance checks, and transparent reporting, providing replication artifacts to advance accountability in applied ML practice.
Abstract
Decision-tree methods are widely used on structured tabular data and are valued for interpretability across many sectors. However, published studies often list the predictors they use (for example age, diagnosis codes, location). Privacy laws increasingly regulate such data types. We use published decision-tree papers as a proxy for real-world use of legally governed data. We compile a corpus of decision-tree studies and assign each reported predictor to a regulated data category (for example health data, biometric identifiers, children's data, financial attributes, location traces, and government IDs). We then link each category to specific excerpts in European Union and United States privacy laws. We find that many reported predictors fall into regulated categories, with the largest shares in healthcare and clear differences across industries. We analyze prevalence, industry composition, and temporal patterns, and summarize regulation-aligned timing using each framework's reference year. Our evidence supports privacy-preserving methods and governance checks, and can inform ML practice beyond decision trees.
