Table of Contents
Fetching ...

Tab-Shapley: Identifying Top-k Tabular Data Quality Insights

Manisha Padala, Lokesh Nagalapatti, Atharv Tyagi, Ramasuri Narayanam, Shiv Kumar Saini

TL;DR

This work tackles unsupervised identification of sources of anomalies in tabular data by introducing Tab-Shapley, a cooperative-game framework that uses Shapley values to rank attributes based on their contribution to detected anomalies. The method derives cell-level anomaly labels via a TABNET-based autoencoder, constructs evidence sets for attributes and records, and exploits a closed-form Shapley value to efficiently prioritize top-$k$ data quality insights that represent block-like anomalous regions. A block-extraction procedure using a scoring matrix and Kadane's algorithm enables scalable identification of the most informative data-quality blocks, with empirical results showing competitive performance relative to supervised SHAP and superiority over DIFFI in concentrating anomalies. The approach promises practical impact for data engineers by providing human-readable, prioritized sources of anomalies, while remaining computationally efficient due to the closed-form Shapley computation and unsupervised labeling pipeline.

Abstract

We present an unsupervised method for aggregating anomalies in tabular datasets by identifying the top-k tabular data quality insights. Each insight consists of a set of anomalous attributes and the corresponding subsets of records that serve as evidence to the user. The process of identifying these insight blocks is challenging due to (i) the absence of labeled anomalies, (ii) the exponential size of the subset search space, and (iii) the complex dependencies among attributes, which obscure the true sources of anomalies. Simple frequency-based methods fail to capture these dependencies, leading to inaccurate results. To address this, we introduce Tab-Shapley, a cooperative game theory based framework that uses Shapley values to quantify the contribution of each attribute to the data's anomalous nature. While calculating Shapley values typically requires exponential time, we show that our game admits a closed-form solution, making the computation efficient. We validate the effectiveness of our approach through empirical analysis on real-world tabular datasets with ground-truth anomaly labels.

Tab-Shapley: Identifying Top-k Tabular Data Quality Insights

TL;DR

This work tackles unsupervised identification of sources of anomalies in tabular data by introducing Tab-Shapley, a cooperative-game framework that uses Shapley values to rank attributes based on their contribution to detected anomalies. The method derives cell-level anomaly labels via a TABNET-based autoencoder, constructs evidence sets for attributes and records, and exploits a closed-form Shapley value to efficiently prioritize top- data quality insights that represent block-like anomalous regions. A block-extraction procedure using a scoring matrix and Kadane's algorithm enables scalable identification of the most informative data-quality blocks, with empirical results showing competitive performance relative to supervised SHAP and superiority over DIFFI in concentrating anomalies. The approach promises practical impact for data engineers by providing human-readable, prioritized sources of anomalies, while remaining computationally efficient due to the closed-form Shapley computation and unsupervised labeling pipeline.

Abstract

We present an unsupervised method for aggregating anomalies in tabular datasets by identifying the top-k tabular data quality insights. Each insight consists of a set of anomalous attributes and the corresponding subsets of records that serve as evidence to the user. The process of identifying these insight blocks is challenging due to (i) the absence of labeled anomalies, (ii) the exponential size of the subset search space, and (iii) the complex dependencies among attributes, which obscure the true sources of anomalies. Simple frequency-based methods fail to capture these dependencies, leading to inaccurate results. To address this, we introduce Tab-Shapley, a cooperative game theory based framework that uses Shapley values to quantify the contribution of each attribute to the data's anomalous nature. While calculating Shapley values typically requires exponential time, we show that our game admits a closed-form solution, making the computation efficient. We validate the effectiveness of our approach through empirical analysis on real-world tabular datasets with ground-truth anomaly labels.
Paper Structure (12 sections, 2 theorems, 8 equations, 5 figures, 3 tables, 2 algorithms)

This paper contains 12 sections, 2 theorems, 8 equations, 5 figures, 3 tables, 2 algorithms.

Key Result

Proposition 1

The above defined cooperative game $(A,\mathscr{V}_a)$ is super-additive.

Figures (5)

  • Figure 1: Top-$k$ insights: Darker cells indicate anomaly. The blocks that are filled with blue patterns show the top-$K$ insights for $K=3$. The results are shown for $\alpha=0.2$; higher values of $\alpha$ would create smaller blocks.
  • Figure 2: Impact of $\alpha$ on $N\!A$ cells contained in the top-$1$ insight.
  • Figure 3: Number of ground-truth anomalies captured in the $2 \times 2$ sub-matrix using Tab-Shapley and DIFFI methods
  • Figure 4: Number of ground-truth anomalies captured in the $4 \times 4$ sub-matrix using Tab-Shapley and DIFFI methods
  • Figure 5: Number of ground-truth anomalies captured in the $6 \times 6$ sub-matrix using Tab-Shapley and DIFFI methods

Theorems & Definitions (6)

  • Definition 1: Evidence Sets for Attributes:
  • Definition 2: Evidence Sets for Records:
  • Proposition 1
  • Lemma 1
  • proof
  • Example 1