Table of Contents
Fetching ...

Mitigating Spurious Correlations in NLI via LLM-Synthesized Counterfactuals and Dynamic Balanced Sampling

Christopher Román Jaimes

TL;DR

This work tackles the pervasive issue of spurious correlations in NLI by introducing a three-component debiasing pipeline: LF-LMI for artifact detection, an LLM-driven synthesis process to create high-quality counterfactuals, and Dynamic Balanced Sampling to train models without catastrophic forgetting. The authors show that standard models rely on artifacts, evidenced by strong hypothesis-only performance and semantic biases such as negation and generalization cues. Their end-to-end pipeline yields a substantial boost in consistency on a contrast-set benchmark (63.5% to 81.0%) while preserving in-domain accuracy (around 88.4%), demonstrating robust reasoning improvements without sacrificing generalization. The approach is scalable and reproducible, reducing reliance on expensive human annotation and offering a practical blueprint for debiasing NLI datasets and models.

Abstract

Natural Language Inference (NLI) models frequently rely on spurious correlations rather than semantic reasoning. Existing mitigation strategies often incur high annotation costs or trigger catastrophic forgetting during fine-tuning. We propose an automated, scalable pipeline to address these limitations. First, we introduce Log-Frequency LMI (LF-LMI) to accurately detect semantic artifacts. Second, we generate a high-quality synthetic contrast set via an LLM-synthesis pipeline with multi-judge verification. Finally, we introduce Dynamic Balanced Sampling, a training strategy that rotates the original data distribution to prevent forgetting. Our method improves consistency on a challenging benchmark from 63.5% to 81.0% while maintaining 88.4% in-domain accuracy, significantly outperforming naive fine-tuning.

Mitigating Spurious Correlations in NLI via LLM-Synthesized Counterfactuals and Dynamic Balanced Sampling

TL;DR

This work tackles the pervasive issue of spurious correlations in NLI by introducing a three-component debiasing pipeline: LF-LMI for artifact detection, an LLM-driven synthesis process to create high-quality counterfactuals, and Dynamic Balanced Sampling to train models without catastrophic forgetting. The authors show that standard models rely on artifacts, evidenced by strong hypothesis-only performance and semantic biases such as negation and generalization cues. Their end-to-end pipeline yields a substantial boost in consistency on a contrast-set benchmark (63.5% to 81.0%) while preserving in-domain accuracy (around 88.4%), demonstrating robust reasoning improvements without sacrificing generalization. The approach is scalable and reproducible, reducing reliance on expensive human annotation and offering a practical blueprint for debiasing NLI datasets and models.

Abstract

Natural Language Inference (NLI) models frequently rely on spurious correlations rather than semantic reasoning. Existing mitigation strategies often incur high annotation costs or trigger catastrophic forgetting during fine-tuning. We propose an automated, scalable pipeline to address these limitations. First, we introduce Log-Frequency LMI (LF-LMI) to accurately detect semantic artifacts. Second, we generate a high-quality synthetic contrast set via an LLM-synthesis pipeline with multi-judge verification. Finally, we introduce Dynamic Balanced Sampling, a training strategy that rotates the original data distribution to prevent forgetting. Our method improves consistency on a challenging benchmark from 63.5% to 81.0% while maintaining 88.4% in-domain accuracy, significantly outperforming naive fine-tuning.

Paper Structure

This paper contains 35 sections, 1 equation, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Overview of our Automated Debiasing Pipeline. (1) Artifacts are identified via LF-LMI. (2) LLMs synthesize counterfactuals, filtered by strict consensus. (3) The model is trained using Dynamic Balanced Sampling: the static contrast set is mixed with a rotating random subset of the original data in each epoch to prevent catastrophic forgetting.
  • Figure 2: Impact of fine-tuning dataset size on model performance.(a) Shows generalization on the original SNLI validation set; Naive Finetuning (red) suffers from catastrophic forgetting as data scales, whereas our Dynamic Balanced strategy (blue) maintains performance near the Base Model. (b) Shows robustness on the synthetic contrast set; both methods achieve similar gains in consistency, demonstrating that our strategy prevents forgetting without compromising the learning of new boundaries.