Table of Contents
Fetching ...

ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs

Abhinaba Basu, Pavan Chakraborty

Abstract

Evaluating whether explanations faithfully reflect a model's reasoning remains an open problem. Existing benchmarks use single interventions without statistical testing, making it impossible to distinguish genuine faithfulness from chance-level performance. We introduce ICE (Intervention-Consistent Explanation), a framework that compares explanations against matched random baselines via randomization tests under multiple intervention operators, yielding win rates with confidence intervals. Evaluating 7 LLMs across 4 English tasks, 6 non-English languages, and 2 attribution methods, we find that faithfulness is operator-dependent: operator gaps reach up to 44 percentage points, with deletion typically inflating estimates on short text but the pattern reversing on long text, suggesting that faithfulness should be interpreted comparatively across intervention operators rather than as a single score. Randomized baselines reveal anti-faithfulness in one-third of configurations, and faithfulness shows zero correlation with human plausibility (|r| < 0.04). Multilingual evaluation reveals dramatic model-language interactions not explained by tokenization alone. We release the ICE framework and ICEBench benchmark.

ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs

Abstract

Evaluating whether explanations faithfully reflect a model's reasoning remains an open problem. Existing benchmarks use single interventions without statistical testing, making it impossible to distinguish genuine faithfulness from chance-level performance. We introduce ICE (Intervention-Consistent Explanation), a framework that compares explanations against matched random baselines via randomization tests under multiple intervention operators, yielding win rates with confidence intervals. Evaluating 7 LLMs across 4 English tasks, 6 non-English languages, and 2 attribution methods, we find that faithfulness is operator-dependent: operator gaps reach up to 44 percentage points, with deletion typically inflating estimates on short text but the pattern reversing on long text, suggesting that faithfulness should be interpreted comparatively across intervention operators rather than as a single score. Randomized baselines reveal anti-faithfulness in one-third of configurations, and faithfulness shows zero correlation with human plausibility (|r| < 0.04). Multilingual evaluation reveals dramatic model-language interactions not explained by tokenization alone. We release the ICE framework and ICEBench benchmark.
Paper Structure (54 sections, 3 equations, 10 figures, 17 tables, 1 algorithm)

This paper contains 54 sections, 3 equations, 10 figures, 17 tables, 1 algorithm.

Figures (10)

  • Figure 1: Evolution of faithfulness evaluation frameworks. ICE builds on evaluation methodology (blue) and scope expansion (green), adding statistical rigor (orange) via randomization testing and operator-consistent evaluation. Arrows show methodological lineage.
  • Figure 2: ICE pipeline on a sentiment example. Attention identifies "gorgeous" and "seductive" as the rationale. Delete removes all other tokens, leaving only the rationale---an unnatural input that preserves the prediction (WR = 92%), but this may reflect OOD artifacts rather than genuine faithfulness. Retrieval Infill replaces non-rationale tokens with tokens randomly sampled from other corpus examples ("slow, familiar"), preserving natural surface form. The rationale must now dominate in realistic context rather than in isolation (WR = 44%). Label tokens are blacklisted, but replacement text may carry incidental sentiment---this is by design, as it tests robustness of the attribution signal. Same rationale, same model, same metric---only the operator differs, yet the verdict changes.
  • Figure 3: English faithfulness (win rate %) under both operators and both attribution methods. Top: Attention. Bottom: Gradient. Left: Deletion. Right: Retrieval Infill. Models (top to bottom): GPT-2, LFM2, Llama-3.2, Llama-3.1, Qwen, Mistral, DeepSeek. Right panels share the same model order. On short text (SST-2), deletion yields higher estimates; on IMDB (long text), the pattern reverses for most models. Green = faithful ($>$60%), yellow = random, red = anti-faithful ($<$40%).
  • Figure 4: Multilingual faithfulness (attention win rate %) under both operators across 6 languages and 4 scripts. Left: Deletion. Right: Retrieval Infill. GPT-2 Hindi increases under retrieval (65.4%$\to$68.8%), while Llama-3.1 drops to anti-faithful (35.1% Hindi, 37.3% Chinese). Gray = no valid output (DeepSeek Arabic/Hindi/Chinese under retrieval due to tokenizer limitations).
  • Figure 5: IoU (human alignment, x-axis) vs. ICE Win Rate (faithfulness, y-axis) across GPT-2, DeepSeek, and Mistral on e-SNLI. All $|r| < 0.04$: no correlation between human alignment and computational faithfulness.
  • ...and 5 more figures