Table of Contents
Fetching ...

Improving Language Agents through BREW

Shashank Kirtania, Param Biyani, Priyanshu Gupta, Yasharth Bajpai, Roshni Iyer, Sumit Gulwani, Gustavo Soares

TL;DR

Large language model–based agents struggle with real-world long-horizon tasks due to costly policy optimization and opaque internal representations. BREW introduces a memory-centric framework that builds a modular, interpretable knowledge base from past interactions and uses Expand-and-Gather MCTS to optimize memory configuration for downstream tasks. Across OSWorld, $\tau^2$-Bench, and SpreadsheetBench, BREW yields $10-20\%$ gains in task precision and $10-15\%$ faster execution via reduced tool calls, while maintaining computational efficiency. This memory-first approach provides robustness, interpretability, and transferability across domains, suggesting a practical pathway toward persistent, adaptable agents in real-world environments.

Abstract

Large Language Model (LLM)-based agents are increasingly applied to tasks requiring structured reasoning, tool use, and environmental adaptation, such as data manipulation, multistep planning, and computer-use automation. However, despite their versatility, current training paradigms for model weight optimization methods, like PPO and GRPO, remain relatively impractical with their high computational overhead for rollout convergence. In addition, the resulting agent policies are difficult to interpret, adapt, or incrementally improve. To address this, we investigate creating and refining structured memory of experiential learning of an agent from its environment as an alternative route to agent optimization. We introduce BREW (Bootstrapping expeRientially-learned Environmental knoWledge), a framework for agent optimization for downstream tasks via KB construction and refinement. In our formulation, we introduce an effective method for partitioning agent memory for more efficient retrieval and refinement. BREW uses task graders and behavior rubrics to learn insights while leveraging state-space search for ensuring robustness from the noise and non-specificity in natural language. Empirical results on real world, domain-grounded benchmarks -- OSWorld, $τ^2$Bench, and SpreadsheetBench -- show BREW achieves $10-20\%$ improvement in task precision, $10-15\%$ reduction in API/tool calls leading to faster execution time, all while maintaining computational efficiency on par with base models. Unlike prior work where memory is treated as static context, we establish the KB as a modular and controllable substrate for agent optimization -- an explicit lever for shaping behavior in a transparent, interpretable, and extensible manner.

Improving Language Agents through BREW

TL;DR

Large language model–based agents struggle with real-world long-horizon tasks due to costly policy optimization and opaque internal representations. BREW introduces a memory-centric framework that builds a modular, interpretable knowledge base from past interactions and uses Expand-and-Gather MCTS to optimize memory configuration for downstream tasks. Across OSWorld, -Bench, and SpreadsheetBench, BREW yields gains in task precision and faster execution via reduced tool calls, while maintaining computational efficiency. This memory-first approach provides robustness, interpretability, and transferability across domains, suggesting a practical pathway toward persistent, adaptable agents in real-world environments.

Abstract

Large Language Model (LLM)-based agents are increasingly applied to tasks requiring structured reasoning, tool use, and environmental adaptation, such as data manipulation, multistep planning, and computer-use automation. However, despite their versatility, current training paradigms for model weight optimization methods, like PPO and GRPO, remain relatively impractical with their high computational overhead for rollout convergence. In addition, the resulting agent policies are difficult to interpret, adapt, or incrementally improve. To address this, we investigate creating and refining structured memory of experiential learning of an agent from its environment as an alternative route to agent optimization. We introduce BREW (Bootstrapping expeRientially-learned Environmental knoWledge), a framework for agent optimization for downstream tasks via KB construction and refinement. In our formulation, we introduce an effective method for partitioning agent memory for more efficient retrieval and refinement. BREW uses task graders and behavior rubrics to learn insights while leveraging state-space search for ensuring robustness from the noise and non-specificity in natural language. Empirical results on real world, domain-grounded benchmarks -- OSWorld, Bench, and SpreadsheetBench -- show BREW achieves improvement in task precision, reduction in API/tool calls leading to faster execution time, all while maintaining computational efficiency on par with base models. Unlike prior work where memory is treated as static context, we establish the KB as a modular and controllable substrate for agent optimization -- an explicit lever for shaping behavior in a transparent, interpretable, and extensible manner.

Paper Structure

This paper contains 70 sections, 8 equations, 6 figures, 5 tables, 5 algorithms.

Figures (6)

  • Figure 1: $\textsc{BREW}$ architecture overview using examples from the OSWorld dataset. Step 1 indicates the trajectory generation process with agent alignment to human-validated rubrics and correctness using task-specific grader. Steps 2--4 indicate the Reflector Agent, which learns key concepts and corresponding insights from trajectories. Step 5 indicates the Integrator Agent, which integrates knowledge from the Reflector Agent to bootstrap the KB. We introduce Expand-and-Gather MCTS for finding the best KB configuration by a reward-guided search.
  • Figure 2: Illustration of $\textsc{BREW}$'s KB optimization process using Expand-and-Gather MCTS with OSWorld examples. In the Expand Phase, for each document $k$, we sample the best node from tree$_k$ using UCT and perfrom node expansion. Node rewards are estimated based on correctness and retrievability. In the Gather Phase, the current best nodes from each tree are gathered at each node. The process is repeated for the next iteration of KB refinement.
  • Figure 3: The bar plot represents the category-wise success rate over various tasks in the OSWorld dataset over the GTA1-agent, whereas the line plot demonstrates the reduction in the number of steps for the successful cases. Note that even in scenarios where the KB doesn't help increase the success rate, it significantly reduces the number of steps needed to succeed.
  • Figure 4: Distribution of errors in $\tau^2$ Bench Retail
  • Figure 5: t-SNE plot of knowledge learned by $\textsc{BREW}$ (blue), an experiencally learning Algorithm, and A-mem (yellow), an agentic tool based memory storage that relying on the LLM to take memory save action, on SpreadsheetBench.
  • ...and 1 more figures