Table of Contents
Fetching ...

Combee: Scaling Prompt Learning for Self-Improving Language Model Agents

Hanchen Li, Runyuan He, Qizheng Zhang, Changxiu Ji, Qiuyang Mang, Xiaokun Chen, Lakshya A Agrawal, Wei-Liang Liao, Eric Yang, Alvin Cheung, James Zou, Kunle Olukotun, Ion Stoica, Joseph E. Gonzalez

Abstract

Recent advances in prompt learning allow large language model agents to acquire task-relevant knowledge from inference-time context without parameter changes. For example, existing methods (like ACE or GEPA) can learn system prompts to improve accuracy based on previous agent runs. However, these methods primarily focus on single-agent or low-parallelism settings. This fundamentally limits their ability to efficiently learn from a large set of collected agentic traces. It would be efficient and beneficial to run prompt learning in parallel to accommodate the growing trend of learning from many agentic traces or parallel agent executions. Yet without a principled strategy for scaling, current methods suffer from quality degradation with high parallelism. To improve both the efficiency and quality of prompt learning, we propose Combee, a novel framework to scale parallel prompt learning for self-improving agents. Combee speeds up learning and enables running many agents in parallel while learning from their aggregate traces without quality degradation. To achieve this, Combee leverages parallel scans and employs an augmented shuffle mechanism; Combee also introduces a dynamic batch size controller to balance quality and delay. Evaluations on AppWorld, Terminal-Bench, Formula, and FiNER demonstrate that Combee achieves up to 17x speedup over previous methods with comparable or better accuracy and equivalent cost.

Combee: Scaling Prompt Learning for Self-Improving Language Model Agents

Abstract

Recent advances in prompt learning allow large language model agents to acquire task-relevant knowledge from inference-time context without parameter changes. For example, existing methods (like ACE or GEPA) can learn system prompts to improve accuracy based on previous agent runs. However, these methods primarily focus on single-agent or low-parallelism settings. This fundamentally limits their ability to efficiently learn from a large set of collected agentic traces. It would be efficient and beneficial to run prompt learning in parallel to accommodate the growing trend of learning from many agentic traces or parallel agent executions. Yet without a principled strategy for scaling, current methods suffer from quality degradation with high parallelism. To improve both the efficiency and quality of prompt learning, we propose Combee, a novel framework to scale parallel prompt learning for self-improving agents. Combee speeds up learning and enables running many agents in parallel while learning from their aggregate traces without quality degradation. To achieve this, Combee leverages parallel scans and employs an augmented shuffle mechanism; Combee also introduces a dynamic batch size controller to balance quality and delay. Evaluations on AppWorld, Terminal-Bench, Formula, and FiNER demonstrate that Combee achieves up to 17x speedup over previous methods with comparable or better accuracy and equivalent cost.

Paper Structure

This paper contains 27 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Summary of improvement snapshot. Combee achieves close-to-optimal quality with significantly reduced training time by increasing the content in the prompt learnt under high parallelism. Experiments with DeepSeek-V3.1 on AppWorld.
  • Figure 2: Context overload from naive scaling. As batch size increases, the aggregator LLM produces monotonically fewer and less useful context updates, directly degrading final accuracy across benchmarks.
  • Figure 3: Overall design of Combee (top) vs. naive scaling (bottom). Combee follows a Map-Shuffle-Reduce paradigm: the Map phase dispatches $n$ parallel agents to execute queries and reflect; the Shuffle phase applies augmented shuffling; and the Reduce phase hierarchically combines reflections via parallel scan aggregation. In contrast, naive scaling feeds all reflections directly into a single prompt update, causing context overload.
  • Figure 4: Combee achieves superior quality-delay trade off on GEPA for finance benchmarks.
  • Figure 5: Combee achieves superior quality-delay trade off on ACE for finance benchmarks.
  • ...and 3 more figures