Table of Contents
Fetching ...

Evaluation Framework for Highlight Explanations of Context Utilisation in Language Models

Jingyi Sun, Pepa Atanasova, Sagnik Ray Choudhury, Sekh Mainul Islam, Isabelle Augenstein

TL;DR

This work introduces the first gold standard HE evaluation framework for context attribution, using controlled test cases with known ground-truth context usage, which avoids the limitations of existing indirect proxy evaluations.

Abstract

Context utilisation, the ability of Language Models (LMs) to incorporate relevant information from the provided context when generating responses, remains largely opaque to users, who cannot determine whether models draw from parametric memory or provided context, nor identify which specific context pieces inform the response. Highlight explanations (HEs) offer a natural solution as they can point the exact context pieces and tokens that influenced model outputs. However, no existing work evaluates their effectiveness in accurately explaining context utilisation. We address this gap by introducing the first gold standard HE evaluation framework for context attribution, using controlled test cases with known ground-truth context usage, which avoids the limitations of existing indirect proxy evaluations. To demonstrate the framework's broad applicability, we evaluate four HE methods -- three established techniques and MechLight, a mechanistic interpretability approach we adapt for this task -- across four context scenarios, four datasets, and five LMs. Overall, we find that MechLight performs best across all context scenarios. However, all methods struggle with longer contexts and exhibit positional biases, pointing to fundamental challenges in explanation accuracy that require new approaches to deliver reliable context utilisation explanations at scale.

Evaluation Framework for Highlight Explanations of Context Utilisation in Language Models

TL;DR

This work introduces the first gold standard HE evaluation framework for context attribution, using controlled test cases with known ground-truth context usage, which avoids the limitations of existing indirect proxy evaluations.

Abstract

Context utilisation, the ability of Language Models (LMs) to incorporate relevant information from the provided context when generating responses, remains largely opaque to users, who cannot determine whether models draw from parametric memory or provided context, nor identify which specific context pieces inform the response. Highlight explanations (HEs) offer a natural solution as they can point the exact context pieces and tokens that influenced model outputs. However, no existing work evaluates their effectiveness in accurately explaining context utilisation. We address this gap by introducing the first gold standard HE evaluation framework for context attribution, using controlled test cases with known ground-truth context usage, which avoids the limitations of existing indirect proxy evaluations. To demonstrate the framework's broad applicability, we evaluate four HE methods -- three established techniques and MechLight, a mechanistic interpretability approach we adapt for this task -- across four context scenarios, four datasets, and five LMs. Overall, we find that MechLight performs best across all context scenarios. However, all methods struggle with longer contexts and exhibit positional biases, pointing to fundamental challenges in explanation accuracy that require new approaches to deliver reliable context utilisation explanations at scale.

Paper Structure

This paper contains 24 sections, 25 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Utility evaluation of two HEs in our framework under Conflicting and Double-Conflicting context setups. In the left-hand example, the model selects the answer from the given passage. Explainer 2 shows better utility than Explainer 1. In the right-hand example, the model selects the answer from passage 2. Explainer 1 shows better utility than Explainer 2.
  • Figure 2: $\Delta Rank@k^{\mathrm{grp}}$ (Eq. \ref{['eq:drankk-cross-instance-group']}) -- average margins for the explanation importance rank of context tokens in context vs. memory answer instances in Conflicting and Irrelevant setups (§\ref{['sec:four-context-setups']}). Higher $\Delta Rank@k$ is better.
  • Figure 3: MDL-Bits@k (left $y$‑axis; Eq. \ref{['eq:mdl']}) and NMutInf@k (right $y$‑axis; Eq. \ref{['eq:mutual-information']}) for explanation simulatability in Conflicting and Irrelevant setups (§\ref{['sec:four-context-setups']}). Lower MDL-Bits@k and higher NMutInf@k the better.
  • Figure 4: $\Delta Rank@k^{\mathrm{grp}}$ (Eq. \ref{['eq:drankk-cross-instance-group']}) -- average margins for the rank of context $c_1$ and $c_1$ between two instance groups $D_{c_1}$ and $D_{c_2}$ in the Double-Conflicting and Mixed setup (§\ref{['input-regime-2']}). Higher $\Delta Rank@k^{\mathrm{grp}}$ is better.
  • Figure 5: $\Delta Rank@k^{\mathrm{inst}}$ (Eq. \ref{['eq:drankk-within-instance']}) -- average within-instance-group margins between the rank of the answer context piece and the other context piece in the Double-Conflicting and Mixed setup (§\ref{['input-regime-2']}). Higher $\Delta Rank@k^{\mathrm{inst}}$ is better.
  • ...and 9 more figures