No More Distractions: an Adaptive Up-Sampling Algorithm to Reduce Data Artifacts

Han Chen

No More Distractions: an Adaptive Up-Sampling Algorithm to Reduce Data Artifacts

Han Chen

TL;DR

Spurious token-label correlations can cause NLP models to perform well on benchmarks without truly understanding semantics. The paper analyzes SNLI data to quantify token-label bias using metrics like $p^{*}$ and $z^{*}$, identifies top biased tokens, and demonstrates that models disproportionately succeed on records containing majority-label tokens. It then introduces AUDAC, an adaptive upsampling method that iteratively balances $p(label|token)$ toward a uniform distribution across labels for the top biased tokens, without requiring human edits. In experiments, applying AUDAC yields a modest but meaningful improvement in overall accuracy (from 89.149% to 89.667%) and increases performance on corrected-token subsets, while substantially reducing artifact indicators, highlighting a practical approach to mitigate data artifacts and improve generalization in NLP.

Abstract

Researchers recently found out that sometimes language models achieve high accuracy on benchmark data set, but they can not generalize very well with even little changes to the original data set. This is sometimes due to data artifacts, model is learning the spurious correlation between tokens and labels, instead of the semantics and logic. In this work, we analyzed SNLI data and visualized such spurious correlations. We proposed an adaptive up-sampling algorithm to correct the data artifacts, which is simple and effective, and does not need human edits or annotation. We did an experiment applying the algorithm to fix the data artifacts in SNLI data and the model trained with corrected data performed significantly better than the model trained with raw SNLI data, overall, as well as on the subset we corrected.

No More Distractions: an Adaptive Up-Sampling Algorithm to Reduce Data Artifacts

TL;DR

and

, identifies top biased tokens, and demonstrates that models disproportionately succeed on records containing majority-label tokens. It then introduces AUDAC, an adaptive upsampling method that iteratively balances

toward a uniform distribution across labels for the top biased tokens, without requiring human edits. In experiments, applying AUDAC yields a modest but meaningful improvement in overall accuracy (from 89.149% to 89.667%) and increases performance on corrected-token subsets, while substantially reducing artifact indicators, highlighting a practical approach to mitigate data artifacts and improve generalization in NLP.

Abstract

Paper Structure (5 sections, 2 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 5 sections, 2 equations, 4 figures, 1 table, 1 algorithm.

Introduction
Data Artifact Analysis
Adaptive Up-Sampling Data Artifacts Correction Algorithm
Experiments
Conclusion

Figures (4)

Figure 1: Artifact statistics in SNLI
Figure 2: Accuracy Comparison for Biased Labels
Figure 3: Artifact statistics in corrected SNLI
Figure 4: Accuracy Comparison for Biased Labels After Correction

No More Distractions: an Adaptive Up-Sampling Algorithm to Reduce Data Artifacts

TL;DR

Abstract

No More Distractions: an Adaptive Up-Sampling Algorithm to Reduce Data Artifacts

Authors

TL;DR

Abstract

Table of Contents

Figures (4)