Table of Contents
Fetching ...

RAZOR: Sharpening Knowledge by Cutting Bias with Unsupervised Text Rewriting

Shuo Yang, Bardh Prenkaj, Gjergji Kasneci

TL;DR

RAZOR tackles the problem of dataset-induced shortcuts in NLP by introducing an unsupervised text-rewriting pipeline that iteratively neutralizes surface biases. It formalizes a shortcut definition based on token importance and positional cues, then identifies high-risk samples via a TF-IDF–positional feature map and a shortcut score that compares across label groups. The method rewrites selected sentences with LLM-generated variants that preserve the original label, validating consistency with a secondary LLM and optimizing the surface-feature distribution (KL divergence) between classes. Empirical results on FEVER, FEVER-Adversarial, MNLI, and SNLI show that RAZOR outperforms prior debiasing methods, reduces bias-bound terms such as negations, and exhibits cross-dataset robustness, even for smaller models. Overall, the work advocates data-centric debiasing as a powerful lever for improving reliability and fairness in language models with practical impact for fact-checking and natural-language inference tasks.

Abstract

Despite the widespread use of LLMs due to their superior performance in various tasks, their high computational costs often lead potential users to opt for the pretraining-finetuning pipeline. However, biases prevalent in manually constructed datasets can introduce spurious correlations between tokens and labels, creating so-called shortcuts and hindering the generalizability of fine-tuned models. Existing debiasing methods often rely on prior knowledge of specific dataset biases, which is challenging to acquire a priori. We propose RAZOR (Rewriting And Zero-bias Optimization Refinement), a novel, unsupervised, and data-focused debiasing approach based on text rewriting for shortcut mitigation. RAZOR leverages LLMs to iteratively rewrite potentially biased text segments by replacing them with heuristically selected alternatives in a shortcut space defined by token statistics and positional information. This process aims to align surface-level text features more closely with diverse label distributions, thereby promoting the learning of genuine linguistic patterns. Compared with unsupervised SoTA models, RAZOR improves by 3.5% on the FEVER and 6.5% on MNLI and SNLI datasets according to the F1 score. Additionally, RAZOR effectively mitigates specific known biases, reducing bias-related terms by x2 without requiring prior bias information, a result that is on par with SoTA models that leverage prior information. Our work prioritizes data manipulation over architectural modifications, emphasizing the pivotal role of data quality in enhancing model performance and fairness. This research contributes to developing more robust evaluation benchmarks for debiasing methods by incorporating metrics for bias reduction and overall model efficacy.

RAZOR: Sharpening Knowledge by Cutting Bias with Unsupervised Text Rewriting

TL;DR

RAZOR tackles the problem of dataset-induced shortcuts in NLP by introducing an unsupervised text-rewriting pipeline that iteratively neutralizes surface biases. It formalizes a shortcut definition based on token importance and positional cues, then identifies high-risk samples via a TF-IDF–positional feature map and a shortcut score that compares across label groups. The method rewrites selected sentences with LLM-generated variants that preserve the original label, validating consistency with a secondary LLM and optimizing the surface-feature distribution (KL divergence) between classes. Empirical results on FEVER, FEVER-Adversarial, MNLI, and SNLI show that RAZOR outperforms prior debiasing methods, reduces bias-bound terms such as negations, and exhibits cross-dataset robustness, even for smaller models. Overall, the work advocates data-centric debiasing as a powerful lever for improving reliability and fairness in language models with practical impact for fact-checking and natural-language inference tasks.

Abstract

Despite the widespread use of LLMs due to their superior performance in various tasks, their high computational costs often lead potential users to opt for the pretraining-finetuning pipeline. However, biases prevalent in manually constructed datasets can introduce spurious correlations between tokens and labels, creating so-called shortcuts and hindering the generalizability of fine-tuned models. Existing debiasing methods often rely on prior knowledge of specific dataset biases, which is challenging to acquire a priori. We propose RAZOR (Rewriting And Zero-bias Optimization Refinement), a novel, unsupervised, and data-focused debiasing approach based on text rewriting for shortcut mitigation. RAZOR leverages LLMs to iteratively rewrite potentially biased text segments by replacing them with heuristically selected alternatives in a shortcut space defined by token statistics and positional information. This process aims to align surface-level text features more closely with diverse label distributions, thereby promoting the learning of genuine linguistic patterns. Compared with unsupervised SoTA models, RAZOR improves by 3.5% on the FEVER and 6.5% on MNLI and SNLI datasets according to the F1 score. Additionally, RAZOR effectively mitigates specific known biases, reducing bias-related terms by x2 without requiring prior bias information, a result that is on par with SoTA models that leverage prior information. Our work prioritizes data manipulation over architectural modifications, emphasizing the pivotal role of data quality in enhancing model performance and fairness. This research contributes to developing more robust evaluation benchmarks for debiasing methods by incorporating metrics for bias reduction and overall model efficacy.

Paper Structure

This paper contains 24 sections, 1 theorem, 20 equations, 7 figures, 4 tables, 1 algorithm.

Key Result

Lemma 1

Given a document of tokens $d_i = \{t_1,t_2,\dots,t_m\}\in\mathcal{D}$, and an attention scores attribution function $h: \mathcal{X} \rightarrow \mathbb{R}^\ell$, if $\hat{d}_i \subset d_i$ is a shortcut, then where

Figures (7)

  • Figure 1: Example of spurious correlation in sentiment classification tasks, where a classifier $f_\theta$ takes Spielberg and New York Subway as shortcuts and makes wrong predictions w.r.t. the ground truth ($\Phi$). The classifier concentrates on the bold tokens to make the prediction; however, the underlined tokens might be more useful in producing the correct label.
  • Figure 2: An example of RAZOR's application in the fact-checking task. The task aims to determine whether a piece of evidence from Wikipedia supports or refutes a claim. The instance here is sampled from the FEVER dataset, where a negation word "not" has been reported to exhibit a spurious correlation with the class refutes.
  • Figure 3: Effect of shortcut-related terms for BERT and RoBERTa with and without RAZOR on 500 randomly sampled original-rewritten pairs on the FEVER dataset. We then test on the FEVER-Adversarial set.
  • Figure 4: The performance of models trained on FEVER when evaluated on the original FEVER test set.
  • Figure 5: The performance of models trained on 500 samples randomly selected from FEVER when evaluated on the FEVER-Adverasial test set.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Definition 1
  • Lemma 1