Table of Contents
Fetching ...

DAGnosis: Localized Identification of Data Inconsistencies using Structures

Nicolas Huynh, Jeroen Berrevoets, Nabeel Seedat, Jonathan Crabbé, Zhaozhi Qian, Mihaela van der Schaar

TL;DR

DAGnosis tackles deployment-time data inconsistencies in tabular data by encoding the training distribution’s feature dependencies as a directed acyclic graph and applying MB-guided, conformal prediction to produce feature-wise prediction intervals. By conditioning on informative subsets derived from the DAG, it achieves localized explanations for inconsistencies and improves detection accuracy, especially in sparse dependency settings. Empirically, it outperforms compressive-representation baselines, remains robust to imperfect DAGs, and enables reliable downstream performance by allowing deferral of inconsistent samples. The approach also provides actionable localization that guides data collection and auditing, marking a meaningful advance in data-centric evaluation for real-world datasets.

Abstract

Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models. While recent data-centric methods are able to identify such inconsistencies with respect to the training set, they suffer from two key limitations: (1) suboptimality in settings where features exhibit statistical independencies, due to their usage of compressive representations and (2) lack of localization to pin-point why a sample might be flagged as inconsistent, which is important to guide future data collection. We solve these two fundamental limitations using directed acyclic graphs (DAGs) to encode the training set's features probability distribution and independencies as a structure. Our method, called DAGnosis, leverages these structural interactions to bring valuable and insightful data-centric conclusions. DAGnosis unlocks the localization of the causes of inconsistencies on a DAG, an aspect overlooked by previous approaches. Moreover, we show empirically that leveraging these interactions (1) leads to more accurate conclusions in detecting inconsistencies, as well as (2) provides more detailed insights into why some samples are flagged.

DAGnosis: Localized Identification of Data Inconsistencies using Structures

TL;DR

DAGnosis tackles deployment-time data inconsistencies in tabular data by encoding the training distribution’s feature dependencies as a directed acyclic graph and applying MB-guided, conformal prediction to produce feature-wise prediction intervals. By conditioning on informative subsets derived from the DAG, it achieves localized explanations for inconsistencies and improves detection accuracy, especially in sparse dependency settings. Empirically, it outperforms compressive-representation baselines, remains robust to imperfect DAGs, and enables reliable downstream performance by allowing deferral of inconsistent samples. The approach also provides actionable localization that guides data collection and auditing, marking a meaningful advance in data-centric evaluation for real-world datasets.

Abstract

Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models. While recent data-centric methods are able to identify such inconsistencies with respect to the training set, they suffer from two key limitations: (1) suboptimality in settings where features exhibit statistical independencies, due to their usage of compressive representations and (2) lack of localization to pin-point why a sample might be flagged as inconsistent, which is important to guide future data collection. We solve these two fundamental limitations using directed acyclic graphs (DAGs) to encode the training set's features probability distribution and independencies as a structure. Our method, called DAGnosis, leverages these structural interactions to bring valuable and insightful data-centric conclusions. DAGnosis unlocks the localization of the causes of inconsistencies on a DAG, an aspect overlooked by previous approaches. Moreover, we show empirically that leveraging these interactions (1) leads to more accurate conclusions in detecting inconsistencies, as well as (2) provides more detailed insights into why some samples are flagged.
Paper Structure (43 sections, 1 theorem, 3 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 43 sections, 1 theorem, 3 equations, 7 figures, 7 tables, 1 algorithm.

Key Result

Proposition A.1

If the data points $X^{(1)}, X^{(2)}, ..., X^{(n+1)}$ are exchangeable (with $\mathcal{D}_{cal}$ defined as the set comprising the first $n$ samples) , we have for all $i \in [d]$ that $\mathbb{P}( X^{(n+1)}_i \in [l_{i, \alpha}(X^{(n+1)}), r_{i, \alpha}(X^{(n+1)})]) \geq 1-\alpha$ for $0 < \alpha <

Figures (7)

  • Figure 1: DAGnosis Provides Precise Analysis. DAGnosis takes a radically different approach compared to other data-evaluation methods. Rather than evaluating each dimension of a new sample in relation to all the other dimensions, we evaluate in relation to the structure of the sample. This may lead to different samples being flagged while giving interpretation for that conclusion.
  • Figure 2: High-level Overview of DAGnosis. DAGnosis evaluates samples in a test-bed dataset $\mathcal{D}_\text{test}$. It first learns a DAG (using a variety of structure learners). Next, DAGnosis builds prediction intervals for every feature using conformal prediction. They are conditioned on smart subsets of the data's features, informed by the DAG.
  • Figure 3: (a): Deferring prediction on $\mathcal{D}_{\mathrm{flagged}}^{(k)}$, the set of samples flagged by DAGnosis, leads to a better downstream accuracy. (b): We report the proportion of test samples which are flagged and are women or men, for both DAGnosis and Data-SUITE (DS). DAGnosis is more accurate than DS because it flags more inconsistent samples, while flagging a similar number of men.
  • Figure 4: We depict the Markov boundary for the feature Country, which is flagged for the given example. An investigation of $\mathcal{D}_{\mathrm{train}}$ shows that this inconsistency can be traced back to the Occupation feature.
  • Figure 5: Adult DAG. DAG discovered with the PC algorithm, using $\mathcal{D}_{\mathrm{train}}$.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Definition 3.1: Markov blanket
  • Definition 3.2: Markov boundary
  • Definition 3.3: Inconsistency
  • Proposition A.1: Marginal coverage