Explanations that reveal all through the definition of encoding
Aahlad Puli, Nhi Nguyen, Rajesh Ranganath
TL;DR
This paper formalizes encoding as a conditional dependence phenomenon in feature-attribution explanations, showing that some explanations predict labels beyond what their own values imply. It demonstrates that existing evaluators either fail to detect encoding or only weakly detect it, and it introduces STRIPE-X, a strong detector that combines EVAL-X with an encode-meter penalty to robustly rank non-encoding explanations above encoding ones. The authors validate STRIPE-X theoretically and empirically across simulations, image patches, and prompts for large language models, revealing encoding even in human- or machine-generated explanations. The work highlights a risk of misleading explanations in practice and provides a principled framework and tool for obtaining more interpretable, transparent explanations.
Abstract
Feature attributions attempt to highlight what inputs drive predictive power. Good attributions or explanations are thus those that produce inputs that retain this predictive power; accordingly, evaluations of explanations score their quality of prediction. However, evaluations produce scores better than what appears possible from the values in the explanation for a class of explanations, called encoding explanations. Probing for encoding remains a challenge because there is no general characterization of what gives the extra predictive power. We develop a definition of encoding that identifies this extra predictive power via conditional dependence and show that the definition fits existing examples of encoding. This definition implies, in contrast to encoding explanations, that non-encoding explanations contain all the informative inputs used to produce the explanation, giving them a "what you see is what you get" property, which makes them transparent and simple to use. Next, we prove that existing scores (ROAR, FRESH, EVAL-X) do not rank non-encoding explanations above encoding ones, and develop STRIPE-X which ranks them correctly. After empirically demonstrating the theoretical insights, we use STRIPE-X to show that despite prompting an LLM to produce non-encoding explanations for a sentiment analysis task, the LLM-generated explanations encode.
