Table of Contents
Fetching ...

Guiding LLMs to Generate High-Fidelity and High-Quality Counterfactual Explanations for Text Classification

Van Bach Nguyen, Christin Seifert, Jörg Schlötterer

TL;DR

This work tackles the challenge of generating high-fidelity counterfactual explanations for text classification without costly fine-tuning by introducing CGG and CGV, two classifier-guided strategies that inject classifier information into LLM-based CF generation. CGG leverages XAI-derived word importance to steer generation, while CGV generates multiple candidates and selects the best by classifier fidelity and minimal edits. Evaluations on the CEVAL benchmark (IMDB and SNLI) show that these methods often outperform state-of-the-art CF approaches across several metrics and can improve classifier robustness through data augmentation. A key finding is that LLMs rely partly on parametric knowledge rather than faithfully following the classifier, underscoring the need for faithful evaluation of CF explanations across high- and low-accuracy classifiers. Overall, the paper demonstrates that modest, classifier-informed prompting can realize high-quality CFs at scale, with practical implications for interpretability and robustness.

Abstract

The need for interpretability in deep learning has driven interest in counterfactual explanations, which identify minimal changes to an instance that change a model's prediction. Current counterfactual (CF) generation methods require task-specific fine-tuning and produce low-quality text. Large Language Models (LLMs), though effective for high-quality text generation, struggle with label-flipping counterfactuals (i.e., counterfactuals that change the prediction) without fine-tuning. We introduce two simple classifier-guided approaches to support counterfactual generation by LLMs, eliminating the need for fine-tuning while preserving the strengths of LLMs. Despite their simplicity, our methods outperform state-of-the-art counterfactual generation methods and are effective across different LLMs, highlighting the benefits of guiding counterfactual generation by LLMs with classifier information. We further show that data augmentation by our generated CFs can improve a classifier's robustness. Our analysis reveals a critical issue in counterfactual generation by LLMs: LLMs rely on parametric knowledge rather than faithfully following the classifier.

Guiding LLMs to Generate High-Fidelity and High-Quality Counterfactual Explanations for Text Classification

TL;DR

This work tackles the challenge of generating high-fidelity counterfactual explanations for text classification without costly fine-tuning by introducing CGG and CGV, two classifier-guided strategies that inject classifier information into LLM-based CF generation. CGG leverages XAI-derived word importance to steer generation, while CGV generates multiple candidates and selects the best by classifier fidelity and minimal edits. Evaluations on the CEVAL benchmark (IMDB and SNLI) show that these methods often outperform state-of-the-art CF approaches across several metrics and can improve classifier robustness through data augmentation. A key finding is that LLMs rely partly on parametric knowledge rather than faithfully following the classifier, underscoring the need for faithful evaluation of CF explanations across high- and low-accuracy classifiers. Overall, the paper demonstrates that modest, classifier-informed prompting can realize high-quality CFs at scale, with practical implications for interpretability and robustness.

Abstract

The need for interpretability in deep learning has driven interest in counterfactual explanations, which identify minimal changes to an instance that change a model's prediction. Current counterfactual (CF) generation methods require task-specific fine-tuning and produce low-quality text. Large Language Models (LLMs), though effective for high-quality text generation, struggle with label-flipping counterfactuals (i.e., counterfactuals that change the prediction) without fine-tuning. We introduce two simple classifier-guided approaches to support counterfactual generation by LLMs, eliminating the need for fine-tuning while preserving the strengths of LLMs. Despite their simplicity, our methods outperform state-of-the-art counterfactual generation methods and are effective across different LLMs, highlighting the benefits of guiding counterfactual generation by LLMs with classifier information. We further show that data augmentation by our generated CFs can improve a classifier's robustness. Our analysis reveals a critical issue in counterfactual generation by LLMs: LLMs rely on parametric knowledge rather than faithfully following the classifier.

Paper Structure

This paper contains 22 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Counterfactual explanations to explain a text classifier. Counterfactuals (CFs) are minimal changes to the input, such that the classifier assigns a different label. We propose and evaluate two approaches to generate CFs by LLMs that use information from the classifier (besides the original input and label) to guide the generation process post-hoc and ante-hoc.
  • Figure 2: Overview of two classifier-guided approaches for generating counterfactual explanations. a) Classifier-Guided generation (CGG) uses an XAI feature importance method to identify relevant words for the prediction, which are used to extend the prompt for generating a counterfactual example with an LM. b) Classifier-Guided Validation (CGV) is a post-hoc method and selects the best counterfactual example from a set of (unguided) generated candidates.
  • Figure 3: Impact of CGG on modification rate (MR) and flip rate (FR) in the SNLI dataset. \ref{['fig:analysis_1']} illustrates the MR for each LLM, with and without CGG. \ref{['fig:analysis_2']} compares the MR for flipped (Success) and non-flipped (Fail) instances, alongside the FR for cases where MR $>$ 0.5 and MR $\leq$ 0.5.
  • Figure 4: Prompt for CFs generation - IMDB - CGG
  • Figure 5: Prompt for CFs generation - SNLI - CGG