Detecting Errors in a Numerical Response via any Regression Model

Hang Zhou; Jonas Mueller; Mayank Kumar; Jane-Ling Wang; Jing Lei

Detecting Errors in a Numerical Response via any Regression Model

Hang Zhou, Jonas Mueller, Mayank Kumar, Jane-Ling Wang, Jing Lei

TL;DR

This work proposes a simple yet efficient filtering procedure for eliminating potential errors, and establishes theoretical guarantees for the method it uses to identify incorrect values with better precision/recall than other approaches.

Abstract

Noise plagues many numerical datasets, where the recorded values in the data may fail to match the true underlying values due to reasons including: erroneous sensors, data entry/processing mistakes, or imperfect human estimates. We consider general regression settings with covariates and a potentially corrupted response whose observed values may contain errors. By accounting for various uncertainties, we introduced veracity scores that distinguish between genuine errors and natural data fluctuations, conditioned on the available covariate information in the dataset. We propose a simple yet efficient filtering procedure for eliminating potential errors, and establish theoretical guarantees for our method. We also contribute a new error detection benchmark involving 5 regression datasets with real-world numerical errors (for which the true values are also known). In this benchmark and additional simulation studies, our method identifies incorrect values with better precision/recall than other approaches.

Detecting Errors in a Numerical Response via any Regression Model

TL;DR

Abstract

Paper Structure (19 sections, 4 theorems, 15 equations, 2 figures, 8 tables, 2 algorithms)

This paper contains 19 sections, 4 theorems, 15 equations, 2 figures, 8 tables, 2 algorithms.

Introduction
Related Work
Methods
Veracity scores
Filtering procedure
Theoretical Analysis
Simulation Study
Conformal inference using the proposed scores
Filtering procedure
Benchmark with Real Data and Real Errors
Discussion
Proofs of theorems in Section \ref{['sec:theo']}
Benchmark Details
Additional Benchmark Comparisons
Relative Residual.
...and 4 more sections

Key Result

Theorem 1

Assume $\mathbf{E}(\epsilon(X)|X)=0$ and is unimodal at $0$, $\epsilon(X)$ and $\epsilon^{\ast}(X)$ are independent. If $|\epsilon(X_i')+\epsilon^{\ast}(X_{i}') |$ stochastically dominates $|\epsilon(X_{i}) |$ in the third order, that is, Then, $\mathbb{P}( S_r(X_{i},Y_{i})<S_r(X_{i}',Y_{i}') )\geq 1/2$.

Figures (2)

Figure 1: Left panel: Synthetic data with non-uniform epistemic and aleatoric uncertainties. 10% of the data points are set to be erroneous with a mean shift of 2, indicated in red. Right pane: Estimated $\hat{u}(x)$ and $\hat{\sigma}(x)$, representing the quantification of epistemic and aleatoric uncertainties.
Figure 2: Flowchart of our proposed algorithm.

Theorems & Definitions (4)

Theorem 1
Corollary 2
Corollary 3
Theorem 4

Detecting Errors in a Numerical Response via any Regression Model

TL;DR

Abstract

Detecting Errors in a Numerical Response via any Regression Model

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (4)