Table of Contents
Fetching ...

Counterfactual Simulatability of LLM Explanations for Generation Tasks

Marvin Limpijankit, Yanda Chen, Melanie Subbiah, Nicholas Deas, Kathleen McKeown

TL;DR

This paper introduces counterfactual simulatability as a framework to evaluate how well LLM explanations for generation tasks enable users to predict model outputs on related counterfactual inputs. It formalizes the problem with notions of simulatability, generality, and precision, and proposes an explanation-decomposition pipeline using an explainer LLM and an annotator to assess atomic units of explanations. The framework is applied to two generation tasks—CNN/DM news summarization and medical suggestion generation—revealing that explanations reliably support mental models and output inference in summarization but are less effective for medical knowledge-based generation. Across both tasks, the authors show that automated annotation via GPT-4 Turbo can approximate human agreement, suggesting a scalable path to evaluate explainability. The results highlight task-dependent limits of current explanations and indicate that counterfactual simulatability may be more suitable for skill-based generation than for knowledge-based generation, informing future directions for robust explanations in high-stakes settings.

Abstract

LLMs can be unpredictable, as even slight alterations to the prompt can cause the output to change in unexpected ways. Thus, the ability of models to accurately explain their behavior is critical, especially in high-stakes settings. One approach for evaluating explanations is counterfactual simulatability, how well an explanation allows users to infer the model's output on related counterfactuals. Counterfactual simulatability has been previously studied for yes/no question answering tasks. We provide a general framework for extending this method to generation tasks, using news summarization and medical suggestion as example use cases. We find that while LLM explanations do enable users to better predict LLM outputs on counterfactuals in the summarization setting, there is significant room for improvement for medical suggestion. Furthermore, our results suggest that the evaluation for counterfactual simulatability may be more appropriate for skill-based tasks as opposed to knowledge-based tasks.

Counterfactual Simulatability of LLM Explanations for Generation Tasks

TL;DR

This paper introduces counterfactual simulatability as a framework to evaluate how well LLM explanations for generation tasks enable users to predict model outputs on related counterfactual inputs. It formalizes the problem with notions of simulatability, generality, and precision, and proposes an explanation-decomposition pipeline using an explainer LLM and an annotator to assess atomic units of explanations. The framework is applied to two generation tasks—CNN/DM news summarization and medical suggestion generation—revealing that explanations reliably support mental models and output inference in summarization but are less effective for medical knowledge-based generation. Across both tasks, the authors show that automated annotation via GPT-4 Turbo can approximate human agreement, suggesting a scalable path to evaluate explainability. The results highlight task-dependent limits of current explanations and indicate that counterfactual simulatability may be more suitable for skill-based generation than for knowledge-based generation, informing future directions for robust explanations in high-stakes settings.

Abstract

LLMs can be unpredictable, as even slight alterations to the prompt can cause the output to change in unexpected ways. Thus, the ability of models to accurately explain their behavior is critical, especially in high-stakes settings. One approach for evaluating explanations is counterfactual simulatability, how well an explanation allows users to infer the model's output on related counterfactuals. Counterfactual simulatability has been previously studied for yes/no question answering tasks. We provide a general framework for extending this method to generation tasks, using news summarization and medical suggestion as example use cases. We find that while LLM explanations do enable users to better predict LLM outputs on counterfactuals in the summarization setting, there is significant room for improvement for medical suggestion. Furthermore, our results suggest that the evaluation for counterfactual simulatability may be more appropriate for skill-based tasks as opposed to knowledge-based tasks.

Paper Structure

This paper contains 40 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Our evaluation pipeline. Given a model's explanation, an LLM is prompted to decompose the explanation into atomic units (left) and generate relevant counterfactuals (right). For each unit, an annotator verifies whether the element appears in the counterfactual (simulatability) and the counterfactual output (precision).
  • Figure 2: Example explanations, counterfactuals, counterfactual outputs, and annotations for news summarization and medical suggestion. Atomic units of the explanation are highlighted (for medical suggestion, blue: patient information, orange: suggestions).
  • Figure 3: Screenshot of the news summarization annotation interface.
  • Figure 4: Screenshots of the medical suggestion annotation interface.