Data Debugging is NP-hard for Classifiers Trained with SGD

Zizheng Guo; Pengyu Chen; Yanzhang Fu; Dongjing Miao

Data Debugging is NP-hard for Classifiers Trained with SGD

Zizheng Guo, Pengyu Chen, Yanzhang Fu, Dongjing Miao

TL;DR

Data debugging seeks a training-data subset whose retraining yields correct test predictions; this work analyzes the computational complexity of Debuggable for SGD-trained linear classifiers. It establishes $\mathrm{NP}$-hardness in the general, unfixed-loss setting, using reductions from problems like Monotone $1$-in-$3$ SAT and Subset Sum, and shows robustness across training orders. When the loss is fixed, the authors obtain a dichotomy: a linear loss case is solvable in linear time, while hinge-like losses are hard in higher dimensions (with $d \ge 2$ and adversarial order) but become tractable in 1D for certain intercept conditions ($\beta \ge 0$). These results highlight fundamental limits of score-based data cleaning approaches and motivate future work with CSP solvers or randomized algorithms for scalable data debugging.

Abstract

Data debugging is to find a subset of the training data such that the model obtained by retraining on the subset has a better accuracy. A bunch of heuristic approaches are proposed, however, none of them are guaranteed to solve this problem effectively. This leaves an open issue whether there exists an efficient algorithm to find the subset such that the model obtained by retraining on it has a better accuracy. To answer this open question and provide theoretical basis for further study on developing better algorithms for data debugging, we investigate the computational complexity of the problem named Debuggable. Given a machine learning model $\mathcal{M}$ obtained by training on dataset $D$ and a test instance $(\mathbf{x}_\text{test},y_\text{test})$ where $\mathcal{M}(\mathbf{x}_\text{test})\neq y_\text{test}$, Debuggable is to determine whether there exists a subset $D^\prime$ of $D$ such that the model $\mathcal{M}^\prime$ obtained by retraining on $D^\prime$ satisfies $\mathcal{M}^\prime(\mathbf{x}_\text{test})=y_\text{test}$. To cover a wide range of commonly used models, we take SGD-trained linear classifier as the model and derive the following main results. (1) If the loss function and the dimension of the model are not fixed, Debuggable is NP-complete regardless of the training order in which all the training samples are processed during SGD. (2) For hinge-like loss functions, a comprehensive analysis on the computational complexity of Debuggable is provided; (3) If the loss function is a linear function, Debuggable can be solved in linear time, that is, data debugging can be solved easily in this case. These results not only highlight the limitations of current approaches but also offer new insights into data debugging.

Data Debugging is NP-hard for Classifiers Trained with SGD

TL;DR

-hardness in the general, unfixed-loss setting, using reductions from problems like Monotone

-in-

SAT and Subset Sum, and shows robustness across training orders. When the loss is fixed, the authors obtain a dichotomy: a linear loss case is solvable in linear time, while hinge-like losses are hard in higher dimensions (with

and adversarial order) but become tractable in 1D for certain intercept conditions (

). These results highlight fundamental limits of score-based data cleaning approaches and motivate future work with CSP solvers or randomized algorithms for scalable data debugging.

Abstract

obtained by training on dataset

and a test instance

where

, Debuggable is to determine whether there exists a subset

such that the model

obtained by retraining on

satisfies

. To cover a wide range of commonly used models, we take SGD-trained linear classifier as the model and derive the following main results. (1) If the loss function and the dimension of the model are not fixed, Debuggable is NP-complete regardless of the training order in which all the training samples are processed during SGD. (2) For hinge-like loss functions, a comprehensive analysis on the computational complexity of Debuggable is provided; (3) If the loss function is a linear function, Debuggable can be solved in linear time, that is, data debugging can be solved easily in this case. These results not only highlight the limitations of current approaches but also offer new insights into data debugging.

Paper Structure (12 sections, 13 theorems, 82 equations, 3 tables, 1 algorithm)

This paper contains 12 sections, 13 theorems, 82 equations, 3 tables, 1 algorithm.

Introduction
Related Works
Preliminaries and Problem Definition
Results for Unfixed Loss Functions
Results for Fixed Loss Functions
The Easy Case
The Hard Case
Discussion and Conclusion
Detailed Proofs for Section \ref{['gd-hardness']}
Detailed Proofs for Section \ref{['fixed-loss']}
Proof of Theorem \ref{['1d-hinge-hard']}
Proof of Theorem \ref{['thm:hinge-hard']} for .

Key Result

Theorem 3.1

Debuggable-Lin is NP-hard for all training orders.

Theorems & Definitions (27)

Theorem 3.1
proof : Proof Sketch
Theorem 4.1
Theorem 4.2
proof
Theorem 4.3
proof : Proof sketch.
Theorem 4.4
Lemma A.1
proof
...and 17 more

Data Debugging is NP-hard for Classifiers Trained with SGD

TL;DR

Abstract

Data Debugging is NP-hard for Classifiers Trained with SGD

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (27)