Table of Contents
Fetching ...

Does Using Counterfactual Help LLMs Explain Textual Importance in Classification?

Nelvin Tan, James Asikin Cheung, Yu-Ching Shih, Dong Yang, Amol Salunkhe

TL;DR

This work addresses explainability for text classification with black-box LLMs by introducing the decision-changing rate to quantify how top-$k$ words drive a decision. It develops three prompting strategies—Direct Prompting (DP), Counterfactual-Parallel (CFP), and Counterfactual-Sequential (CFS)—and extends them with sampling-based weight aggregation. Empirical results on Amazon, SST2, and IMDB using LLaMA3-70B and GPT-4o show that counterfactual approaches, particularly CFP, can improve the identification of influential words while maintaining accuracy, offering cost-efficient and auditable explanations. The study highlights practical implications for deploying explainable LLM-based classifiers and points to future work in multi-class settings and broader model coverage.

Abstract

Large language models (LLMs) are becoming useful in many domains due to their impressive abilities that arise from large training datasets and large model sizes. More recently, they have been shown to be very effective in textual classification tasks, motivating the need to explain the LLMs' decisions. Motivated by practical constrains where LLMs are black-boxed and LLM calls are expensive, we study how incorporating counterfactuals into LLM reasoning can affect the LLM's ability to identify the top words that have contributed to its classification decision. To this end, we introduce a framework called the decision changing rate that helps us quantify the importance of the top words in classification. Our experimental results show that using counterfactuals can be helpful.

Does Using Counterfactual Help LLMs Explain Textual Importance in Classification?

TL;DR

This work addresses explainability for text classification with black-box LLMs by introducing the decision-changing rate to quantify how top- words drive a decision. It develops three prompting strategies—Direct Prompting (DP), Counterfactual-Parallel (CFP), and Counterfactual-Sequential (CFS)—and extends them with sampling-based weight aggregation. Empirical results on Amazon, SST2, and IMDB using LLaMA3-70B and GPT-4o show that counterfactual approaches, particularly CFP, can improve the identification of influential words while maintaining accuracy, offering cost-efficient and auditable explanations. The study highlights practical implications for deploying explainable LLM-based classifiers and points to future work in multi-class settings and broader model coverage.

Abstract

Large language models (LLMs) are becoming useful in many domains due to their impressive abilities that arise from large training datasets and large model sizes. More recently, they have been shown to be very effective in textual classification tasks, motivating the need to explain the LLMs' decisions. Motivated by practical constrains where LLMs are black-boxed and LLM calls are expensive, we study how incorporating counterfactuals into LLM reasoning can affect the LLM's ability to identify the top words that have contributed to its classification decision. To this end, we introduce a framework called the decision changing rate that helps us quantify the importance of the top words in classification. Our experimental results show that using counterfactuals can be helpful.

Paper Structure

This paper contains 32 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Visualization of the different approaches
  • Figure 2: Visualization of the different approaches