Table of Contents
Fetching ...

Data Debugging is NP-hard for Classifiers Trained with SGD

Zizheng Guo, Pengyu Chen, Yanzhang Fu, Dongjing Miao

TL;DR

Data debugging seeks a training-data subset whose retraining yields correct test predictions; this work analyzes the computational complexity of Debuggable for SGD-trained linear classifiers. It establishes $\mathrm{NP}$-hardness in the general, unfixed-loss setting, using reductions from problems like Monotone $1$-in-$3$ SAT and Subset Sum, and shows robustness across training orders. When the loss is fixed, the authors obtain a dichotomy: a linear loss case is solvable in linear time, while hinge-like losses are hard in higher dimensions (with $d \ge 2$ and adversarial order) but become tractable in 1D for certain intercept conditions ($\beta \ge 0$). These results highlight fundamental limits of score-based data cleaning approaches and motivate future work with CSP solvers or randomized algorithms for scalable data debugging.

Abstract

Data debugging is to find a subset of the training data such that the model obtained by retraining on the subset has a better accuracy. A bunch of heuristic approaches are proposed, however, none of them are guaranteed to solve this problem effectively. This leaves an open issue whether there exists an efficient algorithm to find the subset such that the model obtained by retraining on it has a better accuracy. To answer this open question and provide theoretical basis for further study on developing better algorithms for data debugging, we investigate the computational complexity of the problem named Debuggable. Given a machine learning model $\mathcal{M}$ obtained by training on dataset $D$ and a test instance $(\mathbf{x}_\text{test},y_\text{test})$ where $\mathcal{M}(\mathbf{x}_\text{test})\neq y_\text{test}$, Debuggable is to determine whether there exists a subset $D^\prime$ of $D$ such that the model $\mathcal{M}^\prime$ obtained by retraining on $D^\prime$ satisfies $\mathcal{M}^\prime(\mathbf{x}_\text{test})=y_\text{test}$. To cover a wide range of commonly used models, we take SGD-trained linear classifier as the model and derive the following main results. (1) If the loss function and the dimension of the model are not fixed, Debuggable is NP-complete regardless of the training order in which all the training samples are processed during SGD. (2) For hinge-like loss functions, a comprehensive analysis on the computational complexity of Debuggable is provided; (3) If the loss function is a linear function, Debuggable can be solved in linear time, that is, data debugging can be solved easily in this case. These results not only highlight the limitations of current approaches but also offer new insights into data debugging.

Data Debugging is NP-hard for Classifiers Trained with SGD

TL;DR

Data debugging seeks a training-data subset whose retraining yields correct test predictions; this work analyzes the computational complexity of Debuggable for SGD-trained linear classifiers. It establishes -hardness in the general, unfixed-loss setting, using reductions from problems like Monotone -in- SAT and Subset Sum, and shows robustness across training orders. When the loss is fixed, the authors obtain a dichotomy: a linear loss case is solvable in linear time, while hinge-like losses are hard in higher dimensions (with and adversarial order) but become tractable in 1D for certain intercept conditions (). These results highlight fundamental limits of score-based data cleaning approaches and motivate future work with CSP solvers or randomized algorithms for scalable data debugging.

Abstract

Data debugging is to find a subset of the training data such that the model obtained by retraining on the subset has a better accuracy. A bunch of heuristic approaches are proposed, however, none of them are guaranteed to solve this problem effectively. This leaves an open issue whether there exists an efficient algorithm to find the subset such that the model obtained by retraining on it has a better accuracy. To answer this open question and provide theoretical basis for further study on developing better algorithms for data debugging, we investigate the computational complexity of the problem named Debuggable. Given a machine learning model obtained by training on dataset and a test instance where , Debuggable is to determine whether there exists a subset of such that the model obtained by retraining on satisfies . To cover a wide range of commonly used models, we take SGD-trained linear classifier as the model and derive the following main results. (1) If the loss function and the dimension of the model are not fixed, Debuggable is NP-complete regardless of the training order in which all the training samples are processed during SGD. (2) For hinge-like loss functions, a comprehensive analysis on the computational complexity of Debuggable is provided; (3) If the loss function is a linear function, Debuggable can be solved in linear time, that is, data debugging can be solved easily in this case. These results not only highlight the limitations of current approaches but also offer new insights into data debugging.
Paper Structure (12 sections, 13 theorems, 82 equations, 3 tables, 1 algorithm)

This paper contains 12 sections, 13 theorems, 82 equations, 3 tables, 1 algorithm.

Key Result

Theorem 3.1

Debuggable-Lin is NP-hard for all training orders.

Theorems & Definitions (27)

  • Theorem 3.1
  • proof : Proof Sketch
  • Theorem 4.1
  • Theorem 4.2
  • proof
  • Theorem 4.3
  • proof : Proof sketch.
  • Theorem 4.4
  • Lemma A.1
  • proof
  • ...and 17 more