Table of Contents
Fetching ...

SelfElicit: Your Language Model Secretly Knows Where is the Relevant Evidence

Zhining Liu, Rana Ali Amjad, Ravinarayana Adkathimar, Tianxin Wei, Hanghang Tong

TL;DR

SelfElicit addresses LM factuality in context-based QA by performing inference-time evidence highlighting that leverages deeper-layer attention to identify key contextual sentences. It uses a lightweight, training-free mechanism to score sentences via a selected set of evidence-reading layers and then highlights the chosen sentences within the input context to guide the LM toward relevant information. Across six LM families and four QA datasets, SelfElicit yields consistent 5.0%–11.7% improvements in grounded factuality with significantly lower overhead than iterative prompting baselines. Deeper layers prove especially informative for evidence elicitation, and the approach remains robust to context noise and threshold choices, offering a practical, scalable enhancement for real-world RAG-style QA tasks.

Abstract

Providing Language Models (LMs) with relevant evidence in the context (either via retrieval or user-provided) can significantly improve their ability to provide better-grounded responses. However, recent studies have found that LMs often struggle to fully comprehend and utilize key evidence from the context, especially when it contains noise and irrelevant information, an issue common in real-world scenarios. To address this, we propose SelfElicit, an inference-time approach that helps LMs focus on key contextual evidence through self-guided explicit highlighting. By leveraging the inherent evidence-finding capabilities of LMs using the attention scores of deeper layers, our method automatically identifies and emphasizes key evidence within the input context, facilitating more accurate and grounded responses without additional training or iterative prompting. We demonstrate that SelfElicit brings consistent and significant improvement on multiple evidence-based QA tasks for various LM families while maintaining computational efficiency. Our code and documentation are available at https://github.com/ZhiningLiu1998/SelfElicit.

SelfElicit: Your Language Model Secretly Knows Where is the Relevant Evidence

TL;DR

SelfElicit addresses LM factuality in context-based QA by performing inference-time evidence highlighting that leverages deeper-layer attention to identify key contextual sentences. It uses a lightweight, training-free mechanism to score sentences via a selected set of evidence-reading layers and then highlights the chosen sentences within the input context to guide the LM toward relevant information. Across six LM families and four QA datasets, SelfElicit yields consistent 5.0%–11.7% improvements in grounded factuality with significantly lower overhead than iterative prompting baselines. Deeper layers prove especially informative for evidence elicitation, and the approach remains robust to context noise and threshold choices, offering a practical, scalable enhancement for real-world RAG-style QA tasks.

Abstract

Providing Language Models (LMs) with relevant evidence in the context (either via retrieval or user-provided) can significantly improve their ability to provide better-grounded responses. However, recent studies have found that LMs often struggle to fully comprehend and utilize key evidence from the context, especially when it contains noise and irrelevant information, an issue common in real-world scenarios. To address this, we propose SelfElicit, an inference-time approach that helps LMs focus on key contextual evidence through self-guided explicit highlighting. By leveraging the inherent evidence-finding capabilities of LMs using the attention scores of deeper layers, our method automatically identifies and emphasizes key evidence within the input context, facilitating more accurate and grounded responses without additional training or iterative prompting. We demonstrate that SelfElicit brings consistent and significant improvement on multiple evidence-based QA tasks for various LM families while maintaining computational efficiency. Our code and documentation are available at https://github.com/ZhiningLiu1998/SelfElicit.

Paper Structure

This paper contains 40 sections, 4 equations, 5 figures, 13 tables, 1 algorithm.

Figures (5)

  • Figure 1: SelfElicit workflow on a real example with Llama3.1-8B. By locating and explicitly highlighting the initially overlooked 2nd-hop evidence (“SAS was founded in 1941 ...") within the context, SelfElicit guides the model to arrive at the correct answer “1941”.
  • Figure 2: Relative attention to the evidence/non-evidence sections (y-axis) across the layers (x-axis) for different LM families on HotpotQA. Deeper layers pay much greater attention to crucial evidence (green lines) in the context, even when LM responds incorrectly (dashed lines). Best viewed in color.
  • Figure 3: SelfElicit demonstrates robust advantage even in the presence of noisy context (Fig. \ref{['fig:noise-qa']}). When the context passages contain more distracting information, SelfElicit tends to select a significantly smaller portion of text as evidence (Fig. \ref{['fig:noise-elicit-ratio']}) to prevent the LM from being distracted by irrelevant contexts.
  • Figure 4: Impact of elicit threshold $\alpha$ (x-axis) on the QA performance gain (blue bars, left y-axis) and evidence elicit ratio (orange lines, right y-axis) of SelfElicit on four QA tasks. Best viewed in color.
  • Figure 5: Across different LM families and datasets, the deep attention layers highlight the crucial evidence sentences within the context, even when the LM gives incorrect answers (dashed lines). The X-axis is the depth of attention layers, and the Y-axis is the ratio of the average attention per token (APT) in the evidence/non-evidence sections to the APT across the entire context. For the HotpotQA yang2018hotpotqa dataset, We leverage the "supporting_facts" annotations to differentiate evidence and non-evidence sentences within the context. For other datasets, we treat a context sentence as evidence if it contains at least one of the correct answers.