DAGnosis: Localized Identification of Data Inconsistencies using Structures

Nicolas Huynh; Jeroen Berrevoets; Nabeel Seedat; Jonathan Crabbé; Zhaozhi Qian; Mihaela van der Schaar

DAGnosis: Localized Identification of Data Inconsistencies using Structures

Nicolas Huynh, Jeroen Berrevoets, Nabeel Seedat, Jonathan Crabbé, Zhaozhi Qian, Mihaela van der Schaar

TL;DR

DAGnosis tackles deployment-time data inconsistencies in tabular data by encoding the training distribution’s feature dependencies as a directed acyclic graph and applying MB-guided, conformal prediction to produce feature-wise prediction intervals. By conditioning on informative subsets derived from the DAG, it achieves localized explanations for inconsistencies and improves detection accuracy, especially in sparse dependency settings. Empirically, it outperforms compressive-representation baselines, remains robust to imperfect DAGs, and enables reliable downstream performance by allowing deferral of inconsistent samples. The approach also provides actionable localization that guides data collection and auditing, marking a meaningful advance in data-centric evaluation for real-world datasets.

Abstract

Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models. While recent data-centric methods are able to identify such inconsistencies with respect to the training set, they suffer from two key limitations: (1) suboptimality in settings where features exhibit statistical independencies, due to their usage of compressive representations and (2) lack of localization to pin-point why a sample might be flagged as inconsistent, which is important to guide future data collection. We solve these two fundamental limitations using directed acyclic graphs (DAGs) to encode the training set's features probability distribution and independencies as a structure. Our method, called DAGnosis, leverages these structural interactions to bring valuable and insightful data-centric conclusions. DAGnosis unlocks the localization of the causes of inconsistencies on a DAG, an aspect overlooked by previous approaches. Moreover, we show empirically that leveraging these interactions (1) leads to more accurate conclusions in detecting inconsistencies, as well as (2) provides more detailed insights into why some samples are flagged.

DAGnosis: Localized Identification of Data Inconsistencies using Structures

TL;DR

Abstract

Paper Structure (43 sections, 1 theorem, 3 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 43 sections, 1 theorem, 3 equations, 7 figures, 7 tables, 1 algorithm.

INTRODUCTION
RELATED WORK
DAGNOSIS: IDENTIFYING INCONSISTENCIES USING STRUCTURES
Structure-based Assessment of Samples
Leveraging Structure to Flag Inconsistencies
EXPERIMENTS
DAGnosis Flags Inconsistencies Accurately
DAGnosis is Effective Even With Imperfect DAGs
DAGnosis Unlocks Localization
Methodology.
Results.
Reliable Downstream Performance
HOW TO USE DAGNOSIS STEP-BY-STEP
DISCUSSION
Appendix: DAGnosis: Localized Identification of Data Inconsistencies using Structures
...and 28 more sections

Key Result

Proposition A.1

If the data points $X^{(1)}, X^{(2)}, ..., X^{(n+1)}$ are exchangeable (with $\mathcal{D}_{cal}$ defined as the set comprising the first $n$ samples) , we have for all $i \in [d]$ that $\mathbb{P}( X^{(n+1)}_i \in [l_{i, \alpha}(X^{(n+1)}), r_{i, \alpha}(X^{(n+1)})]) \geq 1-\alpha$ for $0 < \alpha <

Figures (7)

Figure 1: DAGnosis Provides Precise Analysis. DAGnosis takes a radically different approach compared to other data-evaluation methods. Rather than evaluating each dimension of a new sample in relation to all the other dimensions, we evaluate in relation to the structure of the sample. This may lead to different samples being flagged while giving interpretation for that conclusion.
Figure 2: High-level Overview of DAGnosis. DAGnosis evaluates samples in a test-bed dataset $\mathcal{D}_\text{test}$. It first learns a DAG (using a variety of structure learners). Next, DAGnosis builds prediction intervals for every feature using conformal prediction. They are conditioned on smart subsets of the data's features, informed by the DAG.
Figure 3: (a): Deferring prediction on $\mathcal{D}_{\mathrm{flagged}}^{(k)}$, the set of samples flagged by DAGnosis, leads to a better downstream accuracy. (b): We report the proportion of test samples which are flagged and are women or men, for both DAGnosis and Data-SUITE (DS). DAGnosis is more accurate than DS because it flags more inconsistent samples, while flagging a similar number of men.
Figure 4: We depict the Markov boundary for the feature Country, which is flagged for the given example. An investigation of $\mathcal{D}_{\mathrm{train}}$ shows that this inconsistency can be traced back to the Occupation feature.
Figure 5: Adult DAG. DAG discovered with the PC algorithm, using $\mathcal{D}_{\mathrm{train}}$.
...and 2 more figures

Theorems & Definitions (4)

Definition 3.1: Markov blanket
Definition 3.2: Markov boundary
Definition 3.3: Inconsistency
Proposition A.1: Marginal coverage

DAGnosis: Localized Identification of Data Inconsistencies using Structures

TL;DR

Abstract

DAGnosis: Localized Identification of Data Inconsistencies using Structures

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (4)