Towards LLM-guided Causal Explainability for Black-box Text Classifiers

Amrita Bhattacharjee; Raha Moraffah; Joshua Garland; Huan Liu

Towards LLM-guided Causal Explainability for Black-box Text Classifiers

Amrita Bhattacharjee, Raha Moraffah, Joshua Garland, Huan Liu

TL;DR

This work introduces a three-step pipeline that uses instruction-tuned LLMs to achieve causal explainability for black-box text classifiers by first uncovering latent unobserved features, then linking them to specific input tokens, and finally generating minimal counterfactual edits to flip predictions. By evaluating across IMDB, AG News, and SNLI with multiple LLMs, the study demonstrates that GPT-4 offers the strongest performance in producing effective and plausible counterfactual explanations, though trade-offs exist between flipping accuracy and content preservation. The findings suggest that latent feature extraction enhances the quality of explanations and that a fully-LMM-driven, three-step process can produce high-quality, causally informative counterfactuals, paving the way for broader applications in causal explainability and NLP reasoning. The work highlights practical implications for auditing and interpreting black-box NLP models in safety- and impact-sensitive applications.

Abstract

With the advent of larger and more complex deep learning models, such as in Natural Language Processing (NLP), model qualities like explainability and interpretability, albeit highly desirable, are becoming harder challenges to tackle and solve. For example, state-of-the-art models in text classification are black-box by design. Although standard explanation methods provide some degree of explainability, these are mostly correlation-based methods and do not provide much insight into the model. The alternative of causal explainability is more desirable to achieve but extremely challenging in NLP due to a variety of reasons. Inspired by recent endeavors to utilize Large Language Models (LLMs) as experts, in this work, we aim to leverage the instruction-following and textual understanding capabilities of recent state-of-the-art LLMs to facilitate causal explainability via counterfactual explanation generation for black-box text classifiers. To do this, we propose a three-step pipeline via which, we use an off-the-shelf LLM to: (1) identify the latent or unobserved features in the input text, (2) identify the input features associated with the latent features, and finally (3) use the identified input features to generate a counterfactual explanation. We experiment with our pipeline on multiple NLP text classification datasets, with several recent LLMs, and present interesting and promising findings.

Towards LLM-guided Causal Explainability for Black-box Text Classifiers

TL;DR

Abstract

Towards LLM-guided Causal Explainability for Black-box Text Classifiers

Authors

TL;DR

Abstract

Table of Contents

Figures (1)