Table of Contents
Fetching ...

When Does Context Help? Error Dynamics of Contextual Information in Large Language Models

Dingzirui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng

TL;DR

The paper introduces a unified error-dynamics framework to explain how arbitrary contextual information affects inference in Transformer LLMs. It proves that, in a single-layer setting, the context-induced error equals the sum of the baseline error and a contextual correction, with explicit norm and direction conditions for error reduction and an upper bound tied to context relevance and complementarity. These results extend to multi-context and multi-layer architectures, preserving the core conditions, and are validated across ICL, RAG, and ME with multiple models and datasets. Empirical findings show that misalignment between context and baseline error, as well as insufficient correction magnitude, are the primary failure modes, motivating a principled context-selection strategy that achieves about 0.6% relative improvement. The work provides a practical theory-to-implementation bridge for designing effective context-enhancement methods in LLM inference, including a vector-direction predictor and a norm-based ranking scheme for context selection.

Abstract

Contextual information at inference time, such as demonstrations, retrieved knowledge, or interaction history, can substantially improve large language models (LLMs) without parameter updates, yet its theoretical role remains poorly understood beyond specific settings such as in-context learning (ICL). We present a unified theoretical framework for analyzing the effect of arbitrary contextual information in Transformer-based LLMs. Our analysis characterizes contextual influence through output error dynamics. In a single-layer Transformer, we prove that the context-conditioned error vector decomposes additively into the baseline error vector and a contextual correction vector. This yields necessary geometric conditions for error reduction: the contextual correction must align with the negative baseline error and satisfy a norm constraint. We further show that the contextual correction norm admits an explicit upper bound determined by context-query relevance and complementarity. These results extend to multi-context and multi-layer Transformers. Experiments across ICL, retrieval-augmented generation, and memory evolution validate our theory and motivate a principled context selection strategy that improves performance by $0.6\%$.

When Does Context Help? Error Dynamics of Contextual Information in Large Language Models

TL;DR

The paper introduces a unified error-dynamics framework to explain how arbitrary contextual information affects inference in Transformer LLMs. It proves that, in a single-layer setting, the context-induced error equals the sum of the baseline error and a contextual correction, with explicit norm and direction conditions for error reduction and an upper bound tied to context relevance and complementarity. These results extend to multi-context and multi-layer architectures, preserving the core conditions, and are validated across ICL, RAG, and ME with multiple models and datasets. Empirical findings show that misalignment between context and baseline error, as well as insufficient correction magnitude, are the primary failure modes, motivating a principled context-selection strategy that achieves about 0.6% relative improvement. The work provides a practical theory-to-implementation bridge for designing effective context-enhancement methods in LLM inference, including a vector-direction predictor and a norm-based ranking scheme for context selection.

Abstract

Contextual information at inference time, such as demonstrations, retrieved knowledge, or interaction history, can substantially improve large language models (LLMs) without parameter updates, yet its theoretical role remains poorly understood beyond specific settings such as in-context learning (ICL). We present a unified theoretical framework for analyzing the effect of arbitrary contextual information in Transformer-based LLMs. Our analysis characterizes contextual influence through output error dynamics. In a single-layer Transformer, we prove that the context-conditioned error vector decomposes additively into the baseline error vector and a contextual correction vector. This yields necessary geometric conditions for error reduction: the contextual correction must align with the negative baseline error and satisfy a norm constraint. We further show that the contextual correction norm admits an explicit upper bound determined by context-query relevance and complementarity. These results extend to multi-context and multi-layer Transformers. Experiments across ICL, retrieval-augmented generation, and memory evolution validate our theory and motivate a principled context selection strategy that improves performance by .
Paper Structure (58 sections, 6 theorems, 72 equations, 24 figures, 8 tables)

This paper contains 58 sections, 6 theorems, 72 equations, 24 figures, 8 tables.

Key Result

Theorem 2

$e(t,x) = e(x) + g(t,x)$

Figures (24)

  • Figure 1: The illustration of how the contextual information affects the output error. The red vector denotes the baseline error vector $e(x)$, the blue vector denotes the contextual correction vector $g(t,x)$, and the green vector denotes the context-conditioned error vector $e(t,x)$. (a) the norm and direction of the contextual correction vector are appropriate, so the resulting error norm is smaller than the original error; (b) the norm of the contextual correction vector is appropriate, but its direction is not aligned with the negative direction of the error, which leads to a larger error; (c) the angle of the contextual correction vector is appropriate, but its norm is much larger than that of the original error, which also leads to a larger error. We present a case study in Appendix \ref{['app:case_study']}.
  • Figure 2: The variation of error change $\|e(t,x)\| - \|e(x)\|$ with respect to the contextual correction vector $g(t, x)$, where each point represents one data instance. The length of the vector from the axis origin to each point represents $\frac{\|g(t,x)\|}{2\|e(x)\|}$, and the angle with the positive $x$-axis represents $\arccos{\rho(t,x)}$ (i.e., the angle between the contextual correction vector and the negative baseline error vector). Blue and red indicate that the error change is smaller or larger than $0$, respectively, and darker colors indicate larger absolute values. The gray region corresponds to $\frac{\|g(t,x)\|}{2\|e(x)\|} < \rho(t,x)$.
  • Figure 3: The variation of error change norm $\|e(t,x)-e(x)\|$ with respect to the contextual correction vector norm $\|g(t,x)\|$, where each point represents one data instance. The curves at the top and to the right show the distributions of the data points along the $x$-axis and $y$-axis, respectively. The Pearson correlation coefficient for the fitted points is $0.903$.
  • Figure 4: The variation of contextual correction vector norm $\|g(t,x)\|$ with respect to the relevance and complementarity $\alpha_{x \leftarrow t}\|v_t-v_x\|$, where each point represents one data instance. The curves at the top and to the right show the distributions of the data points along the $x$-axis and $y$-axis, respectively. The Pearson correlation coefficient for the fitted points is $0.770$.
  • Figure 5: The variation of error change norm with respect to the contextual correction vector norm $\|g(t,x)\|$ under the multi-context and multi-layer settings, where $\Delta e$ denotes $e(T, x | t_{n+1}) - e(T, x)$ (multi-context) or $e^{(L)}(t,x) - e^{(L)}(x)$ (multi-layer), and each point represents one data instance. The curves at the top and to the right show the distributions of the data points along the $x$-axis and $y$-axis, respectively.
  • ...and 19 more figures

Theorems & Definitions (13)

  • Definition 1: Contextual Correction Vector
  • Theorem 2
  • Theorem 3
  • Corollary 4
  • Corollary 5
  • Lemma 6
  • proof : Proof of Lemma \ref{['lem:attn-diff']}
  • Lemma 7
  • proof : Proof of Lemma \ref{['lem:mlp-lipschitz']}
  • proof : Proof of Theorem \ref{['thm:error-decomp']}
  • ...and 3 more