Table of Contents
Fetching ...

Decocted Experience Improves Test-Time Inference in LLM Agents

Maohao Shen, Kaiwen Zha, Zexue He, Zhang-Wei Hong, Siru Ouyang, J. Jon Ryu, Prasanna Sattigeri, Suhas Diggavi, Gregory Wornell

Abstract

There is growing interest in improving LLMs without updating model parameters. One well-established direction is test-time scaling, where increased inference-time computation (e.g., longer reasoning, sampling, or search) is used to improve performance. However, for complex reasoning and agentic tasks, naively scaling test-time compute can substantially increase cost and still lead to wasted budget on suboptimal exploration. In this paper, we explore \emph{context} as a complementary scaling axis for improving LLM performance, and systematically study how to construct better inputs that guide reasoning through \emph{experience}. We show that effective context construction critically depends on \emph{decocted experience}. We present a detailed analysis of experience-augmented agents, studying how to derive context from experience, how performance scales with accumulated experience, what characterizes good context, and which data structures best support context construction. We identify \emph{decocted experience} as a key mechanism for effective context construction: extracting essence from experience, organizing it coherently, and retrieving salient information to build effective context. We validate our findings across reasoning and agentic tasks, including math reasoning, web browsing, and software engineering.

Decocted Experience Improves Test-Time Inference in LLM Agents

Abstract

There is growing interest in improving LLMs without updating model parameters. One well-established direction is test-time scaling, where increased inference-time computation (e.g., longer reasoning, sampling, or search) is used to improve performance. However, for complex reasoning and agentic tasks, naively scaling test-time compute can substantially increase cost and still lead to wasted budget on suboptimal exploration. In this paper, we explore \emph{context} as a complementary scaling axis for improving LLM performance, and systematically study how to construct better inputs that guide reasoning through \emph{experience}. We show that effective context construction critically depends on \emph{decocted experience}. We present a detailed analysis of experience-augmented agents, studying how to derive context from experience, how performance scales with accumulated experience, what characterizes good context, and which data structures best support context construction. We identify \emph{decocted experience} as a key mechanism for effective context construction: extracting essence from experience, organizing it coherently, and retrieving salient information to build effective context. We validate our findings across reasoning and agentic tasks, including math reasoning, web browsing, and software engineering.

Paper Structure

This paper contains 34 sections, 1 theorem, 7 equations, 11 figures, 1 algorithm.

Key Result

Proposition 4.1

Fix $X=x$ and $C=c$. Suppose there exists a constant $h>0$ such that In words, each token carries at least $h$ bits of uncertainty on average. Then, $\blacktriangleleft$$\blacktriangleleft$

Figures (11)

  • Figure 1: Experience-Augmented Agent. The agent accumulates experience from past interactions, decocts it into effective context for improved inference at test time, i.e., distilling lessons from experience, organizing the experience memory, and finally retrieving salient information from it.
  • Figure 2: Raw Experience vs. Distilled Lesson as Context. Both context construction approaches significantly outperform the vanilla agent without context. Raw experience is slightly stronger for mathematical reasoning, while distilled lessons yield better performance in agentic tasks (WebShop & SWE), where trajectory-level observations are noisier and distillation helps.
  • Figure 3: Scaling Behavior. Agent's performance as a function of input context length when increasing $K$ in Top-$K$ retrieval. Distilled lessons achieve better performance with fewer input tokens and remain more robust as context grows. In contrast, raw experience can degrade when the prompt becomes excessively long and noisy. Overall, lesson distillation acts as an effective context compression mechanism that extracts the essential information from experience.
  • Figure 4: Experience Scaling Behavior via Memory Consolidation. We evaluate experience memory consolidation across varying memory sizes. The red dashed line indicates full memory baseline performance. Memory consolidation achieves a sweet spot at intermediate sizes.
  • Figure 5: Empirical validation of Proposition \ref{['prop:efficiency']}. Figure (a)'s strong linear correlation confirms that $\hat{H}(Y \mid X=x, C=c)$ tightly predicts the expected output length up to a constant. Figure (b) shows that relevant context (retrieved lessons) yields higher information gain than random context.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Proposition 4.1: Context Efficiency Bound
  • proof