General learned delegation by clones

Darren Li; Meiqi Chen; Chenze Shao; Fandong Meng; Jie Zhou

General learned delegation by clones

Darren Li, Meiqi Chen, Chenze Shao, Fandong Meng, Jie Zhou

TL;DR

The paper tackles the inefficiency of test-time computation in large language models when solving long-context, multi-step tasks. It introduces SELFCEST, a minimal tool-based framework where a root agent spawns shared-weight clones to perform sub-tasks in parallel and series under a fixed budget, trained end-to-end with a global reward $R(\tau)$ and rollout-based credit assignment. Key contributions include a formal joint-MDP setup with shared parameters, GRPO-style advantage, COMA- and difference-reward-inspired baselines, rollout gating for stability, and an analysis of bias versus variance in the learning process; it demonstrates improved accuracy-cost Pareto frontiers on arithmetic, long-context reasoning, and multi-hop QA, with evidence of out-of-domain generalization. The work highlights delegation as a practical inference primitive, suggesting broader implications for compute-aware reasoning and future integration with more sophisticated credit assignment techniques and tighter budget optimization. Practically, SELFCEST offers a path toward more efficient, scalable reasoning by learning how to allocate computation across parallel subcalls while maintaining accuracy under fixed inference budgets.

Abstract

Frontier language models improve with additional test-time computation, but serial reasoning or uncoordinated parallel sampling can be compute-inefficient under fixed inference budgets. We propose SELFCEST, which equips a base model with the ability to spawn same-weight clones in separate parallel contexts by agentic reinforcement learning. Training is end-to-end under a global task reward with shared-parameter rollouts, yielding a learned controller that allocates both generation and context budget across branches. Across challenging math reasoning benchmarks and long-context multi-hop QA, SELFCEST improves the accuracy-cost Pareto frontier relative to monolithic baselines at matched inference budget, and exhibits out-of-distribution generalization in both domains.

General learned delegation by clones

TL;DR

and rollout-based credit assignment. Key contributions include a formal joint-MDP setup with shared parameters, GRPO-style advantage, COMA- and difference-reward-inspired baselines, rollout gating for stability, and an analysis of bias versus variance in the learning process; it demonstrates improved accuracy-cost Pareto frontiers on arithmetic, long-context reasoning, and multi-hop QA, with evidence of out-of-domain generalization. The work highlights delegation as a practical inference primitive, suggesting broader implications for compute-aware reasoning and future integration with more sophisticated credit assignment techniques and tighter budget optimization. Practically, SELFCEST offers a path toward more efficient, scalable reasoning by learning how to allocate computation across parallel subcalls while maintaining accuracy under fixed inference budgets.

Abstract

Paper Structure (30 sections, 14 equations, 4 figures, 6 tables)

This paper contains 30 sections, 14 equations, 4 figures, 6 tables.

Introduction
Related work
Optimizing test-time scaling.
Agentic reinforcement learning.
Credit assignment problem.
Method
Theoretical setup.
GRPO advantage.
COMA-style counterfactual baselines.
Difference rewards.
Rollout gating.
Bias due to rollout gating.
Connection between counterfactual baselines and difference rewards.
Reward function.
Pure arithmetic experiments
...and 15 more sections

Figures (4)

Figure 1: A model trained with SELFCEST acts as a root agent, coordinateing helper clones (instantiated on the same parameters) for sub-tasks in parallel and then in series, successfully generalizing outside training data and outperforming various other methods in accuracy and token consumption.
Figure 2: Illustration of one training step of SELFCEST using GRPO. A positive advantage is conferred to roots 0 and 2 -- the roots returning correct final answers -- and clones 0, 1, 3, 7, 8, 9, and 10; a negative advantage is conferred to roots 1 and 3, as well as clones 4, 5, 6, and 11; clones 2 and 12 is conferred zero advantage and their trajectories do not participate in backpropagation. Despite the correct computation of Clone 5 ultimately getting a negative advantage, this crude gating is enough to guarantee stability.
Figure 3: SELFCEST solves more problems in less tokens.
Figure 4: Comparison of reward between our final training run and the run without rollout gating with the highest final reward.

General learned delegation by clones

TL;DR

Abstract

General learned delegation by clones

Authors

TL;DR

Abstract

Table of Contents

Figures (4)