ALMANACS: A Simulatability Benchmark for Language Model Explainability

Edmund Mills; Shiye Su; Stuart Russell; Scott Emmons

ALMANACS: A Simulatability Benchmark for Language Model Explainability

Edmund Mills, Shiye Su, Stuart Russell, Scott Emmons

TL;DR

ALMANACS introduces a fully automated simulatability benchmark to evaluate language-model explainability, framing explanations via an explainer $\\mathcal{E}$ and evaluating them with a predictor $\\mathcal{P}$ on distributionally shifted Yes/No scenarios across 12 safety topics. It tests Counterfactuals, Rationalizations, Attention, and Integrated Gradients using GPT-4 as the automated predictor, reporting evaluation with $KLDiv$, $TVDist$, and Spearman metrics. Across two LLMs (e.g., $flan-alpaca-gpt4-xl$ and $vicuna-7b-v1.3$), results show that none of the explanation methods consistently improves simulatability on average, underscoring the challenge of creating explanations that aid predictive reasoning. The study also validates the GPT-4 predictor’s capabilities and finds broad alignment with human predictors, though some subtasks show divergence, highlighting ALMANACS as a complementary tool rather than a replacement for human evaluation. Overall, ALMANACS offers a scalable, automated framework for rapid screening of explainability methods and contributes to understanding the limits of current explanations in aiding model simulatability $\left(\text{explanations}\not\Rightarrow\ \text{predictive gains on average}\right)$.

Abstract

How do we measure the efficacy of language model explainability methods? While many explainability methods have been developed, they are typically evaluated on bespoke tasks, preventing an apples-to-apples comparison. To help fill this gap, we present ALMANACS, a language model explainability benchmark. ALMANACS scores explainability methods on simulatability, i.e., how well the explanations improve behavior prediction on new inputs. The ALMANACS scenarios span twelve safety-relevant topics such as ethical reasoning and advanced AI behaviors; they have idiosyncratic premises to invoke model-specific behavior; and they have a train-test distributional shift to encourage faithful explanations. By using another language model to predict behavior based on the explanations, ALMANACS is a fully automated benchmark. While not a replacement for human evaluations, we aim for ALMANACS to be a complementary, automated tool that allows for fast, scalable evaluation. Using ALMANACS, we evaluate counterfactual, rationalization, attention, and Integrated Gradients explanations. Our results are sobering: when averaged across all topics, no explanation method outperforms the explanation-free control. We conclude that despite modest successes in prior work, developing an explanation method that aids simulatability in ALMANACS remains an open challenge.

ALMANACS: A Simulatability Benchmark for Language Model Explainability

TL;DR

ALMANACS introduces a fully automated simulatability benchmark to evaluate language-model explainability, framing explanations via an explainer

and evaluating them with a predictor

on distributionally shifted Yes/No scenarios across 12 safety topics. It tests Counterfactuals, Rationalizations, Attention, and Integrated Gradients using GPT-4 as the automated predictor, reporting evaluation with

, and Spearman metrics. Across two LLMs (e.g.,

and

), results show that none of the explanation methods consistently improves simulatability on average, underscoring the challenge of creating explanations that aid predictive reasoning. The study also validates the GPT-4 predictor’s capabilities and finds broad alignment with human predictors, though some subtasks show divergence, highlighting ALMANACS as a complementary tool rather than a replacement for human evaluation. Overall, ALMANACS offers a scalable, automated framework for rapid screening of explainability methods and contributes to understanding the limits of current explanations in aiding model simulatability

Abstract

Paper Structure (48 sections, 6 equations, 24 figures, 5 tables)

This paper contains 48 sections, 6 equations, 24 figures, 5 tables.

Introduction
Benchmark design
Framework for explanations
Templates and dataset generation
Evaluation metrics
Explanation methods
Naive baselines
Counterfactuals
Rationalizations
Attention
Integrated Gradients
Results
Validating the automated LLM predictor
Can the GPT-4 predictor understand explanations and apply them in new scenarios?
Do results with the GPT-4 predictor agree with results from human predictors?
...and 33 more sections

Figures (24)

Figure 1: Explainer / predictor framework in the ALMANACS Yes/No scenarios. (Top) The explainer $\mathcal{E}$ augments the model behavior dataset with explanations. Four explanation methods are depicted: counterfactuals, rationalizations, salience, and Integrated Gradients. (Bottom) The predictor $\mathcal{P}$ references the explanation-augmented dataset to predict model behavior. Its predictions are scored against model responses by KL divergence, TV distance, and Spearman's $\rho$.
Figure 2: How language models behave in ALMANACS. (Left) The total probability assigned to Yes- and No-like tokens. (Center) The average probability of Yes. (Right) How much a model's answers vary within each template, measured by the average total variation distance between scenarios drawn from the same template. We see that ALMANACS elicits idiosyncratic behavior.
Figure 3: Benchmark design. (Left) ALMANACS templates delineate Yes/No questions in which each of 5 placeholder phrases is selected from a set of 15 values. Each placeholder phrase significantly impacts the question's premise. (Right) Selecting different phrase combinations introduces a distributional shift between training and testing.
Figure 4: (a) GPT-4's prediction performance on ALMANACS for a synthetic linear model. (b) Human performance on sample of ALMANACS topics for flan-alpaca-gpt4-xl. (c) GPT-4 performance on the same sample of questions.
Figure 5: Example template from the MoralDilemmas task. For brevity, only 4 out of 15 values per variable slot are shown.
...and 19 more figures

ALMANACS: A Simulatability Benchmark for Language Model Explainability

TL;DR

Abstract

ALMANACS: A Simulatability Benchmark for Language Model Explainability

Authors

TL;DR

Abstract

Table of Contents

Figures (24)