Table of Contents
Fetching ...

Explanations that reveal all through the definition of encoding

Aahlad Puli, Nhi Nguyen, Rajesh Ranganath

TL;DR

This paper formalizes encoding as a conditional dependence phenomenon in feature-attribution explanations, showing that some explanations predict labels beyond what their own values imply. It demonstrates that existing evaluators either fail to detect encoding or only weakly detect it, and it introduces STRIPE-X, a strong detector that combines EVAL-X with an encode-meter penalty to robustly rank non-encoding explanations above encoding ones. The authors validate STRIPE-X theoretically and empirically across simulations, image patches, and prompts for large language models, revealing encoding even in human- or machine-generated explanations. The work highlights a risk of misleading explanations in practice and provides a principled framework and tool for obtaining more interpretable, transparent explanations.

Abstract

Feature attributions attempt to highlight what inputs drive predictive power. Good attributions or explanations are thus those that produce inputs that retain this predictive power; accordingly, evaluations of explanations score their quality of prediction. However, evaluations produce scores better than what appears possible from the values in the explanation for a class of explanations, called encoding explanations. Probing for encoding remains a challenge because there is no general characterization of what gives the extra predictive power. We develop a definition of encoding that identifies this extra predictive power via conditional dependence and show that the definition fits existing examples of encoding. This definition implies, in contrast to encoding explanations, that non-encoding explanations contain all the informative inputs used to produce the explanation, giving them a "what you see is what you get" property, which makes them transparent and simple to use. Next, we prove that existing scores (ROAR, FRESH, EVAL-X) do not rank non-encoding explanations above encoding ones, and develop STRIPE-X which ranks them correctly. After empirically demonstrating the theoretical insights, we use STRIPE-X to show that despite prompting an LLM to produce non-encoding explanations for a sentiment analysis task, the LLM-generated explanations encode.

Explanations that reveal all through the definition of encoding

TL;DR

This paper formalizes encoding as a conditional dependence phenomenon in feature-attribution explanations, showing that some explanations predict labels beyond what their own values imply. It demonstrates that existing evaluators either fail to detect encoding or only weakly detect it, and it introduces STRIPE-X, a strong detector that combines EVAL-X with an encode-meter penalty to robustly rank non-encoding explanations above encoding ones. The authors validate STRIPE-X theoretically and empirically across simulations, image patches, and prompts for large language models, revealing encoding even in human- or machine-generated explanations. The work highlights a risk of misleading explanations in practice and provides a principled framework and tool for obtaining more interpretable, transparent explanations.

Abstract

Feature attributions attempt to highlight what inputs drive predictive power. Good attributions or explanations are thus those that produce inputs that retain this predictive power; accordingly, evaluations of explanations score their quality of prediction. However, evaluations produce scores better than what appears possible from the values in the explanation for a class of explanations, called encoding explanations. Probing for encoding remains a challenge because there is no general characterization of what gives the extra predictive power. We develop a definition of encoding that identifies this extra predictive power via conditional dependence and show that the definition fits existing examples of encoding. This definition implies, in contrast to encoding explanations, that non-encoding explanations contain all the informative inputs used to produce the explanation, giving them a "what you see is what you get" property, which makes them transparent and simple to use. Next, we prove that existing scores (ROAR, FRESH, EVAL-X) do not rank non-encoding explanations above encoding ones, and develop STRIPE-X which ranks them correctly. After empirically demonstrating the theoretical insights, we use STRIPE-X to show that despite prompting an LLM to produce non-encoding explanations for a sentiment analysis task, the LLM-generated explanations encode.

Paper Structure

This paper contains 58 sections, 16 theorems, 152 equations, 7 figures, 6 tables, 2 algorithms.

Key Result

Proposition 1

For the in eq:sim-example-main, roar and fresh assign their respective optimal scores to the encoding explanation ${e_\textrm{encode}(\boldsymbol{\mathbf{x}})}$.

Figures (7)

  • Figure 1: Overview of the paper. Explanations are produced to find inputs that are relevant to predicting a label. However, explanations can predict the label well due to the selection being predictive of the label beyond the explanation's values. Such explanations are called encoding. In contrast, predicting instead from a non-encoding explanation is equivalent to predicting from the values in the explanation. When explanations are evaluated purely based on the quality of prediction, encoding can go undetected. We classify existing evaluations into non-detectors and weak detectors and develop a strong detector, called stripe-x.
  • Figure 2: Intuition for encoding: There are two ways the information in the inputs $\boldsymbol{\mathbf{x}}$ about the label $\boldsymbol{\mathbf{y}}$ is transmitted to the explanation $\boldsymbol{\mathbf{x}}_{e(\boldsymbol{\mathbf{x}})}$: (1) through the values in the explanation and (2) the selection $e(\boldsymbol{\mathbf{x}})$ (in red). When the latter happens, the explanation is said to be encoding.
  • Figure 3: Left: Consider data where the color in the left half determines whether the label "cat", "dog") is produced from the top or bottom image on the right. Right: A margenc encoding explanation that produces only the top or the bottom animal image based on the color. The animal image alone says less about the label than knowing the animal image and the color. Knowing the selection determines the color and thus provides additional information about the label.
  • Figure 4: EVAL-X and stripe-x scores of the $3$ encoding constructions and the non-encoding constant explanation $(e(\boldsymbol{\mathbf{x}}) = \xi_1)$, for both . EVAL-X, being only a weak detector, assigns suboptimal scores to all encoding explanations ($<$), but scores some encoding explanations above the constant explanation. On the other hand, stripe-x, being a strong detector, pushes down the scores of all the encoding explanations below that of the non-encoding constant explanation that always selects $\boldsymbol{\mathbf{x}}_1$.
  • Figure 5: Example and encoding. (a) The color determines whether the label is produced from the top or bottom image. (b) An explanation that correctly reveals that the label is generated based on both the color and, as dictated by the color, the top or the bottom image. The label is deterministic given the value of the explanation which means the label can be predicted perfectly. (c) An encoding explanation would be one that produces only the top or the bottom animal image based on the color being red of blue respectively. This returned animal image does not indicate the fact that the data generating process depends on color. Now, the animal image selected by the explanation alone is insufficient to dictate the label because the color determines which image determines the label. The identity of the image, whether top or bottom, provides additional information about the label beyond the values explanation, as captured in \ref{['def:encoding']}.
  • ...and 2 more figures

Theorems & Definitions (31)

  • Definition 1: Encoding
  • Definition 2: Weak detection of encoding
  • Definition 3: Strong detection of encoding
  • Proposition 1
  • Theorem 1
  • Proposition 2
  • Proposition 3
  • Theorem 2
  • Lemma 1
  • proof
  • ...and 21 more