Table of Contents
Fetching ...

llmSHAP: A Principled Approach to LLM Explainability

Filip Naudot, Tobias Sundqvist, Timotheus Kampik

TL;DR

llmSHAP systematically analyzes how stochastic LLM decoding affects Shapley-value explainability and introduces deterministic variants (e.g., cache-based Shapley) that restore axiomatic guarantees while offering speedups. By formalizing setup, axiomatic implications, and complexity, the work maps clear trade-offs between fidelity to exact Shapley attributions, inference speed, and principle attainment for LLM-based decision support. Empirical results on a disease-symptom task show that cache-based attribution remains stable across feature counts, while sliding-window and counterfactual approaches trade off speed and axioms. The study provides actionable guidance and open-source tooling for practitioners designing explainable LLM systems, and points to future directions like applying attribution to internal chain-of-thought steps.

Abstract

Feature attribution methods help make machine learning-based inference explainable by determining how much one or several features have contributed to a model's output. A particularly popular attribution method is based on the Shapley value from cooperative game theory, a measure that guarantees the satisfaction of several desirable principles, assuming deterministic inference. We apply the Shapley value to feature attribution in large language model (LLM)-based decision support systems, where inference is, by design, stochastic (non-deterministic). We then demonstrate when we can and cannot guarantee Shapley value principle satisfaction across different implementation variants applied to LLM-based decision support, and analyze how the stochastic nature of LLMs affects these guarantees. We also highlight trade-offs between explainable inference speed, agreement with exact Shapley value attributions, and principle attainment.

llmSHAP: A Principled Approach to LLM Explainability

TL;DR

llmSHAP systematically analyzes how stochastic LLM decoding affects Shapley-value explainability and introduces deterministic variants (e.g., cache-based Shapley) that restore axiomatic guarantees while offering speedups. By formalizing setup, axiomatic implications, and complexity, the work maps clear trade-offs between fidelity to exact Shapley attributions, inference speed, and principle attainment for LLM-based decision support. Empirical results on a disease-symptom task show that cache-based attribution remains stable across feature counts, while sliding-window and counterfactual approaches trade off speed and axioms. The study provides actionable guidance and open-source tooling for practitioners designing explainable LLM systems, and points to future directions like applying attribution to internal chain-of-thought steps.

Abstract

Feature attribution methods help make machine learning-based inference explainable by determining how much one or several features have contributed to a model's output. A particularly popular attribution method is based on the Shapley value from cooperative game theory, a measure that guarantees the satisfaction of several desirable principles, assuming deterministic inference. We apply the Shapley value to feature attribution in large language model (LLM)-based decision support systems, where inference is, by design, stochastic (non-deterministic). We then demonstrate when we can and cannot guarantee Shapley value principle satisfaction across different implementation variants applied to LLM-based decision support, and analyze how the stochastic nature of LLMs affects these guarantees. We also highlight trade-offs between explainable inference speed, agreement with exact Shapley value attributions, and principle attainment.

Paper Structure

This paper contains 10 sections, 8 theorems, 9 equations, 4 figures, 1 table, 3 algorithms.

Key Result

proposition 1

$\shapCS$ satisfies the Shapley axioms efficiency axiom:efficiency, symmetry axiom:symmetry, and Null player (dummy) axiom:dummy.

Figures (4)

  • Figure 1: Illustration of Shapley value computation for each feature $x_i \in X$, with the feature of interest in orange, coalition features ($S$) in blue, and excluded features in gray. For each coalition $S \subseteq X \setminus \{x_i\}$ (blue), subtract $v(S)$ from $v(S \cup \{x_i\})$. Weigh each marginal contribution by $\frac{|S|!(|X|-|S|-1)!}{|X|!}$, and sum to obtain $\phi_i(v)$ in \ref{['eq:original_shapley_value']}.
  • Figure 2: Illustration of how two inferences drawn by the inference function $h$ using the same feature coalition $S$ may yield different results, as the underlying sampling process is governed by a probability distribution.
  • Figure 3: Cosine similarity between the attribution vectors of the standard Shapley value $\shapS$ (gold standard) and the counterfactual $\shapC$, sliding window (window size of 3) $\shapSW_{w=3}$, and the cached-based Shapley value $\shapCS$.
  • Figure 4: Average runtime (seconds) for 4--10 features, based on two runs per feature count.

Theorems & Definitions (16)

  • proposition 1
  • proof
  • proposition 2
  • proof
  • proposition 3
  • proof
  • proposition 4
  • proof
  • proposition 5
  • proof
  • ...and 6 more