Table of Contents
Fetching ...

CATfOOD: Counterfactual Augmented Training for Improving Out-of-Domain Performance and Calibration

Rachneet Sachdeva, Martin Tutek, Iryna Gurevych

TL;DR

This work tackles the fragility of small language models in extractive QA under distribution shifts by augmenting training with counterfactuals (CFs) generated by a suite of large language models (LLMs). It introduces Solo-QAG and Duo-QAG generation modes, alongside Retrieve-Generate-Filter, and demonstrates that diverse, high-quality CFs consistently boost out-of-domain performance and model calibration. The authors further enhance calibrators with dense semantic features derived from explanation tokens and investigate the relation between explanation properties (comprehensiveness and sufficiency) and calibration, finding that stronger sufficiency correlates with better calibration. Overall, the approach yields robust OOD improvements and more reliable confidence estimates, with practical implications for deploying smaller QA models in real-world, diverse domains.

Abstract

In recent years, large language models (LLMs) have shown remarkable capabilities at scale, particularly at generating text conditioned on a prompt. In our work, we investigate the use of LLMs to augment training data of small language models~(SLMs) with automatically generated counterfactual~(CF) instances -- i.e. minimally altered inputs -- in order to improve out-of-domain~(OOD) performance of SLMs in the extractive question answering~(QA) setup. We show that, across various LLM generators, such data augmentation consistently enhances OOD performance and improves model calibration for both confidence-based and rationale-augmented calibrator models. Furthermore, these performance improvements correlate with higher diversity of CF instances in terms of their surface form and semantic content. Finally, we show that CF augmented models which are easier to calibrate also exhibit much lower entropy when assigning importance, indicating that rationale-augmented calibrators prefer concise explanations.

CATfOOD: Counterfactual Augmented Training for Improving Out-of-Domain Performance and Calibration

TL;DR

This work tackles the fragility of small language models in extractive QA under distribution shifts by augmenting training with counterfactuals (CFs) generated by a suite of large language models (LLMs). It introduces Solo-QAG and Duo-QAG generation modes, alongside Retrieve-Generate-Filter, and demonstrates that diverse, high-quality CFs consistently boost out-of-domain performance and model calibration. The authors further enhance calibrators with dense semantic features derived from explanation tokens and investigate the relation between explanation properties (comprehensiveness and sufficiency) and calibration, finding that stronger sufficiency correlates with better calibration. Overall, the approach yields robust OOD improvements and more reliable confidence estimates, with practical implications for deploying smaller QA models in real-world, diverse domains.

Abstract

In recent years, large language models (LLMs) have shown remarkable capabilities at scale, particularly at generating text conditioned on a prompt. In our work, we investigate the use of LLMs to augment training data of small language models~(SLMs) with automatically generated counterfactual~(CF) instances -- i.e. minimally altered inputs -- in order to improve out-of-domain~(OOD) performance of SLMs in the extractive question answering~(QA) setup. We show that, across various LLM generators, such data augmentation consistently enhances OOD performance and improves model calibration for both confidence-based and rationale-augmented calibrator models. Furthermore, these performance improvements correlate with higher diversity of CF instances in terms of their surface form and semantic content. Finally, we show that CF augmented models which are easier to calibrate also exhibit much lower entropy when assigning importance, indicating that rationale-augmented calibrators prefer concise explanations.
Paper Structure (45 sections, 3 equations, 6 figures, 12 tables)

This paper contains 45 sections, 3 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: An illustration of the counterfactual samples (purple) for the input question (green) produced by the RGF baseline and our approaches using LLMs. While RGF produces a question closely related to the input, LLMs generate more diverse questions with respect to surface form and semantic content.
  • Figure 2: Our proposed methodology for generating counterfactual instances. The Solo-QAG approach (left) generates counterfactual QA pairs in a single pass while the Duo-QAG approach (right) first generates the question, and then the answer.
  • Figure 3: Our proposed calibration methodology. The dense representations of the highly important input tokens from the CF-augmented model are condensed and converted to semantic features to train a classifier that predicts if the model prediction is correct.
  • Figure 4: Quantitative evaluation of fluency and correctness of the CF instances generated by the RGF, LLaMA, GPT-NeoxT, and Flan-UL2 models.
  • Figure 5: Percentage improvement of CF augmented models' calibration performance over the unaugmented RoBERTa-base model trained on SQuAD, using features based on probability (conf) and rationales from shap, scaled attention and integrated gradients. The results for conf (row #1) are reported on models which do not use explanation-based features. In the remaining experiments (other rows), along with base and rgf, we report the results of dense-feature augmented calibrators. We provide the complete results with other datasets and explanation methods in \ref{['app:model_calib']}.
  • ...and 1 more figures