Table of Contents
Fetching ...

Echoes as Anchors: Probabilistic Costs and Attention Refocusing in LLM Reasoning

Zhuoyuan Hao, Zhuo Li, Wu Li, Fangming Liu, Min Zhang, Jing Li

TL;DR

The paper identifies the Echo of Prompt (EOP) as a spontaneous front-loaded repetition in large reasoning models and formalizes its cost via a rejection-sampling framework, introducing the Echo Likelihood Gap $Δ\mathcal{L}$ as a proxy for the echo’s trade-off with downstream accuracy. It demonstrates that EOP acts as an attention refocusing mechanism, with increased within-trace attention to the answer-prefix in mid layers correlating with correctness. To harness this phenomenon, the authors propose Echo-Distilled SFT (ED-SFT), which trains models to adopt an echo-then-reason pattern, and Echoic Prompting (EP), a training-free method that re-ground the model on the original prompt during inference. Across GSM8K, MathQA, Hendrycks-MATH, AIME24, and MATH-500, both ED-SFT and EP yield consistent performance gains and show robust generalization across architectures, supporting the view of EOP as a beneficial cognitive primitive rather than a mere flaw. The work offers mechanistic explanations, including mid-layer attention dynamics and information-flow pathways, and provides practical guidance for cultivating robust self-aligned reasoning in LRMs.

Abstract

Test-time compute allocation in large reasoning models (LRMs) is widely used and has applications in mathematical problem solving, code synthesis, and planning. Recent work has addressed this problem by scaling self-consistency and parallel thinking, adding generic ``thinking tokens'' and prompting models to re-read the question before answering. Unfortunately, these approaches either inject task-agnostic tokens or mandate heuristics that do not explain -- and often ignore -- the \emph{spontaneous} repetition that many LRMs exhibit at the head of their internal chains. In contrast, we analyze and harness the model's tendency to restate the question, which we term the \emph{Echo of Prompt (EOP)}, as a front-loaded, compute-shaping mechanism. We formalize its probabilistic cost by casting echo removal as rejection-based conditioning and defining the \emph{Echo Likelihood Gap} $Δ\mathcal{L}$ as a computable proxy. This provides the missing theoretical link that links early repetition to likelihood gains and downstream accuracy. However, it does not by itself specify how to exploit EOP. Consequently, we develop \emph{Echo-Distilled SFT (ED-SFT)} to instill an ``echo-then-reason'' pattern through supervised finetuning, and \emph{Echoic Prompting (EP)} to re-ground the model mid-trace without training. While promising, quantifying benefits beyond verbosity is non-trivial. Therefore, we conduct length and suffix-controlled likelihood analyses together with layer-wise attention studies, showing that EOP increases answer to answer-prefix attention in middle layers, consistent with an \emph{attention refocusing} mechanism. We evaluate on GSM8K, MathQA, Hendrycks-MATH, AIME24, and MATH-500 under identical decoding settings and budgets, and find consistent gains over baselines. Code is available at https://github.com/hhh2210/echoes-as-anchors.

Echoes as Anchors: Probabilistic Costs and Attention Refocusing in LLM Reasoning

TL;DR

The paper identifies the Echo of Prompt (EOP) as a spontaneous front-loaded repetition in large reasoning models and formalizes its cost via a rejection-sampling framework, introducing the Echo Likelihood Gap as a proxy for the echo’s trade-off with downstream accuracy. It demonstrates that EOP acts as an attention refocusing mechanism, with increased within-trace attention to the answer-prefix in mid layers correlating with correctness. To harness this phenomenon, the authors propose Echo-Distilled SFT (ED-SFT), which trains models to adopt an echo-then-reason pattern, and Echoic Prompting (EP), a training-free method that re-ground the model on the original prompt during inference. Across GSM8K, MathQA, Hendrycks-MATH, AIME24, and MATH-500, both ED-SFT and EP yield consistent performance gains and show robust generalization across architectures, supporting the view of EOP as a beneficial cognitive primitive rather than a mere flaw. The work offers mechanistic explanations, including mid-layer attention dynamics and information-flow pathways, and provides practical guidance for cultivating robust self-aligned reasoning in LRMs.

Abstract

Test-time compute allocation in large reasoning models (LRMs) is widely used and has applications in mathematical problem solving, code synthesis, and planning. Recent work has addressed this problem by scaling self-consistency and parallel thinking, adding generic ``thinking tokens'' and prompting models to re-read the question before answering. Unfortunately, these approaches either inject task-agnostic tokens or mandate heuristics that do not explain -- and often ignore -- the \emph{spontaneous} repetition that many LRMs exhibit at the head of their internal chains. In contrast, we analyze and harness the model's tendency to restate the question, which we term the \emph{Echo of Prompt (EOP)}, as a front-loaded, compute-shaping mechanism. We formalize its probabilistic cost by casting echo removal as rejection-based conditioning and defining the \emph{Echo Likelihood Gap} as a computable proxy. This provides the missing theoretical link that links early repetition to likelihood gains and downstream accuracy. However, it does not by itself specify how to exploit EOP. Consequently, we develop \emph{Echo-Distilled SFT (ED-SFT)} to instill an ``echo-then-reason'' pattern through supervised finetuning, and \emph{Echoic Prompting (EP)} to re-ground the model mid-trace without training. While promising, quantifying benefits beyond verbosity is non-trivial. Therefore, we conduct length and suffix-controlled likelihood analyses together with layer-wise attention studies, showing that EOP increases answer to answer-prefix attention in middle layers, consistent with an \emph{attention refocusing} mechanism. We evaluate on GSM8K, MathQA, Hendrycks-MATH, AIME24, and MATH-500 under identical decoding settings and budgets, and find consistent gains over baselines. Code is available at https://github.com/hhh2210/echoes-as-anchors.
Paper Structure (55 sections, 11 equations, 9 figures, 11 tables)

This paper contains 55 sections, 11 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: An illustration of the Echo of Prompt (EOP). Left: An example of a model's thinking process starting with an echo of the user's query. Right: The frequency of EOP across several open-source models on the GSM8K dataset, as measured by our trained MLP probe (see §\ref{['app:mlp_probe']}).
  • Figure 2: Echo metrics on GSM8K. Left: High-resolution histogram of removed echo-prefix lengths (10-token bins) for correct and wrong traces; most mass lies between roughly 200 and 240 tokens. Right: Echo Likelihood Gap $\Delta\mathcal{L}$ (per-token) stratified by removed-prefix length bin; the gap remains positive across all bins.
  • Figure 3: Layer-wise attention weight distribution on GSM8K (DeepSeek-R1-Distill-Llama-8B) for Left: answer$\to$answer-prefix and Right: answer$\to$question. The blue lines represent correct reasoning traces while orange lines represent incorrect ones. The attention refocusing effect is most pronounced in layers 7-18 for answer$\to$answer-prefix, with correct traces maintaining consistently higher attention weights.
  • Figure 4: Echoic Prompting (EP) vs. TTTS on AIME24 (left) and MATH-500 (right).
  • Figure 5: Comparison of raw attention difference (Correct $-$ Wrong) and normalized effect size (Cohen's $d$) across layers for the answer$\to$answer-prefix metric. The strong alignment between the raw and normalized metrics confirms that the mid-layer refocusing peak is a robust phenomenon.
  • ...and 4 more figures