Table of Contents
Fetching ...

General learned delegation by clones

Darren Li, Meiqi Chen, Chenze Shao, Fandong Meng, Jie Zhou

TL;DR

The paper tackles the inefficiency of test-time computation in large language models when solving long-context, multi-step tasks. It introduces SELFCEST, a minimal tool-based framework where a root agent spawns shared-weight clones to perform sub-tasks in parallel and series under a fixed budget, trained end-to-end with a global reward $R(\tau)$ and rollout-based credit assignment. Key contributions include a formal joint-MDP setup with shared parameters, GRPO-style advantage, COMA- and difference-reward-inspired baselines, rollout gating for stability, and an analysis of bias versus variance in the learning process; it demonstrates improved accuracy-cost Pareto frontiers on arithmetic, long-context reasoning, and multi-hop QA, with evidence of out-of-domain generalization. The work highlights delegation as a practical inference primitive, suggesting broader implications for compute-aware reasoning and future integration with more sophisticated credit assignment techniques and tighter budget optimization. Practically, SELFCEST offers a path toward more efficient, scalable reasoning by learning how to allocate computation across parallel subcalls while maintaining accuracy under fixed inference budgets.

Abstract

Frontier language models improve with additional test-time computation, but serial reasoning or uncoordinated parallel sampling can be compute-inefficient under fixed inference budgets. We propose SELFCEST, which equips a base model with the ability to spawn same-weight clones in separate parallel contexts by agentic reinforcement learning. Training is end-to-end under a global task reward with shared-parameter rollouts, yielding a learned controller that allocates both generation and context budget across branches. Across challenging math reasoning benchmarks and long-context multi-hop QA, SELFCEST improves the accuracy-cost Pareto frontier relative to monolithic baselines at matched inference budget, and exhibits out-of-distribution generalization in both domains.

General learned delegation by clones

TL;DR

The paper tackles the inefficiency of test-time computation in large language models when solving long-context, multi-step tasks. It introduces SELFCEST, a minimal tool-based framework where a root agent spawns shared-weight clones to perform sub-tasks in parallel and series under a fixed budget, trained end-to-end with a global reward and rollout-based credit assignment. Key contributions include a formal joint-MDP setup with shared parameters, GRPO-style advantage, COMA- and difference-reward-inspired baselines, rollout gating for stability, and an analysis of bias versus variance in the learning process; it demonstrates improved accuracy-cost Pareto frontiers on arithmetic, long-context reasoning, and multi-hop QA, with evidence of out-of-domain generalization. The work highlights delegation as a practical inference primitive, suggesting broader implications for compute-aware reasoning and future integration with more sophisticated credit assignment techniques and tighter budget optimization. Practically, SELFCEST offers a path toward more efficient, scalable reasoning by learning how to allocate computation across parallel subcalls while maintaining accuracy under fixed inference budgets.

Abstract

Frontier language models improve with additional test-time computation, but serial reasoning or uncoordinated parallel sampling can be compute-inefficient under fixed inference budgets. We propose SELFCEST, which equips a base model with the ability to spawn same-weight clones in separate parallel contexts by agentic reinforcement learning. Training is end-to-end under a global task reward with shared-parameter rollouts, yielding a learned controller that allocates both generation and context budget across branches. Across challenging math reasoning benchmarks and long-context multi-hop QA, SELFCEST improves the accuracy-cost Pareto frontier relative to monolithic baselines at matched inference budget, and exhibits out-of-distribution generalization in both domains.
Paper Structure (30 sections, 14 equations, 4 figures, 6 tables)

This paper contains 30 sections, 14 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: A model trained with SELFCEST acts as a root agent, coordinateing helper clones (instantiated on the same parameters) for sub-tasks in parallel and then in series, successfully generalizing outside training data and outperforming various other methods in accuracy and token consumption.
  • Figure 2: Illustration of one training step of SELFCEST using GRPO. A positive advantage is conferred to roots 0 and 2 -- the roots returning correct final answers -- and clones 0, 1, 3, 7, 8, 9, and 10; a negative advantage is conferred to roots 1 and 3, as well as clones 4, 5, 6, and 11; clones 2 and 12 is conferred zero advantage and their trajectories do not participate in backpropagation. Despite the correct computation of Clone 5 ultimately getting a negative advantage, this crude gating is enough to guarantee stability.
  • Figure 3: SELFCEST solves more problems in less tokens.
  • Figure 4: Comparison of reward between our final training run and the run without rollout gating with the highest final reward.