Table of Contents
Fetching ...

Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language?

Peter Hase, Shiyue Zhang, Harry Xie, Mohit Bansal

TL;DR

This work introduces leakage-adjusted simulatability (LAS) to evaluate how model-generated natural language explanations support faithful model behavior, separating semantic content from mere leakage of the output. It implements four NL explanation graphical models (ST-Re, ST-Ra, MT-Re, MT-Ra) atop T5, and introduces a leakage-aware metric validated with human experiments. The results show rationalizing explanations often yield higher LAS than reasoning ones, with ST-Ra approaching human-level performance and no strong accuracy trade-offs. A multi-agent optimization framework further improves LAS in some settings, underscoring the practical potential of optimizing explanations for simulatability. The authors also provide code for reproducing their LAS-based evaluations on CoS-E and SNLI tasks.

Abstract

Data collection for natural language (NL) understanding tasks has increasingly included human explanations alongside data points, allowing past works to introduce models that both perform a task and generate NL explanations for their outputs. Yet to date, model-generated explanations have been evaluated on the basis of surface-level similarities to human explanations, both through automatic metrics like BLEU and human evaluations. We argue that these evaluations are insufficient, since they fail to indicate whether explanations support actual model behavior (faithfulness), rather than simply match what a human would say (plausibility). In this work, we address the problem of evaluating explanations from the model simulatability perspective. Our contributions are as follows: (1) We introduce a leakage-adjusted simulatability (LAS) metric for evaluating NL explanations, which measures how well explanations help an observer predict a model's output, while controlling for how explanations can directly leak the output. We use a model as a proxy for a human observer, and validate this choice with two human subject experiments. (2) Using the CoS-E and e-SNLI datasets, we evaluate two existing generative graphical models and two new approaches; one rationalizing method we introduce achieves roughly human-level LAS scores. (3) Lastly, we frame explanation generation as a multi-agent game and optimize explanations for simulatability while penalizing label leakage, which can improve LAS scores. We provide code for the experiments in this paper at https://github.com/peterbhase/LAS-NL-Explanations

Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language?

TL;DR

This work introduces leakage-adjusted simulatability (LAS) to evaluate how model-generated natural language explanations support faithful model behavior, separating semantic content from mere leakage of the output. It implements four NL explanation graphical models (ST-Re, ST-Ra, MT-Re, MT-Ra) atop T5, and introduces a leakage-aware metric validated with human experiments. The results show rationalizing explanations often yield higher LAS than reasoning ones, with ST-Ra approaching human-level performance and no strong accuracy trade-offs. A multi-agent optimization framework further improves LAS in some settings, underscoring the practical potential of optimizing explanations for simulatability. The authors also provide code for reproducing their LAS-based evaluations on CoS-E and SNLI tasks.

Abstract

Data collection for natural language (NL) understanding tasks has increasingly included human explanations alongside data points, allowing past works to introduce models that both perform a task and generate NL explanations for their outputs. Yet to date, model-generated explanations have been evaluated on the basis of surface-level similarities to human explanations, both through automatic metrics like BLEU and human evaluations. We argue that these evaluations are insufficient, since they fail to indicate whether explanations support actual model behavior (faithfulness), rather than simply match what a human would say (plausibility). In this work, we address the problem of evaluating explanations from the model simulatability perspective. Our contributions are as follows: (1) We introduce a leakage-adjusted simulatability (LAS) metric for evaluating NL explanations, which measures how well explanations help an observer predict a model's output, while controlling for how explanations can directly leak the output. We use a model as a proxy for a human observer, and validate this choice with two human subject experiments. (2) Using the CoS-E and e-SNLI datasets, we evaluate two existing generative graphical models and two new approaches; one rationalizing method we introduce achieves roughly human-level LAS scores. (3) Lastly, we frame explanation generation as a multi-agent game and optimize explanations for simulatability while penalizing label leakage, which can improve LAS scores. We provide code for the experiments in this paper at https://github.com/peterbhase/LAS-NL-Explanations

Paper Structure

This paper contains 27 sections, 6 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Graphical models representing varying roles of explanations, where the task input is denoted by $x$, task output by $y$, and explanation by $e$. We introduce a new rationalizing model, ST-Ra, while also testing a reasoning multi-task model, MT-Re, and two other methods from past works camburu_e-snli:_2018rajani_explain_2019.
  • Figure 2: Inputs and outputs for the T5 Multi-task framework. In the reasoning mode, explanations are not conditioned on the model's prediction, whereas in the rationalizing mode they are dependent on the model output.
  • Figure 3: Causal diagram of model simulation. The simulator prediction's correctness, $\mathbbm{1}[ \hat{y}|x,\hat{e}]$, is influenced by three variables: (1) the task model input, (2) the model explanation's semantic content, $\hat{e}_z$, and (3) whether the explanation leaks the model output, $\hat{e}_{\hat{y}}$
  • Figure 4: A simulator model predicts a task model output, given its input and a model-generated explanation.
  • Figure 5: Inputs and outputs for the sequence to sequence ST-Ra framework. One explanation is generated for each answer choice, conditioned on the choice. The sequences and answers are supplied to a sequence-to-sequence task model for scoring. We use separate T5 models for the generator and task model.
  • ...and 2 more figures