General learned delegation by clones
Darren Li, Meiqi Chen, Chenze Shao, Fandong Meng, Jie Zhou
TL;DR
The paper tackles the inefficiency of test-time computation in large language models when solving long-context, multi-step tasks. It introduces SELFCEST, a minimal tool-based framework where a root agent spawns shared-weight clones to perform sub-tasks in parallel and series under a fixed budget, trained end-to-end with a global reward $R(\tau)$ and rollout-based credit assignment. Key contributions include a formal joint-MDP setup with shared parameters, GRPO-style advantage, COMA- and difference-reward-inspired baselines, rollout gating for stability, and an analysis of bias versus variance in the learning process; it demonstrates improved accuracy-cost Pareto frontiers on arithmetic, long-context reasoning, and multi-hop QA, with evidence of out-of-domain generalization. The work highlights delegation as a practical inference primitive, suggesting broader implications for compute-aware reasoning and future integration with more sophisticated credit assignment techniques and tighter budget optimization. Practically, SELFCEST offers a path toward more efficient, scalable reasoning by learning how to allocate computation across parallel subcalls while maintaining accuracy under fixed inference budgets.
Abstract
Frontier language models improve with additional test-time computation, but serial reasoning or uncoordinated parallel sampling can be compute-inefficient under fixed inference budgets. We propose SELFCEST, which equips a base model with the ability to spawn same-weight clones in separate parallel contexts by agentic reinforcement learning. Training is end-to-end under a global task reward with shared-parameter rollouts, yielding a learned controller that allocates both generation and context budget across branches. Across challenging math reasoning benchmarks and long-context multi-hop QA, SELFCEST improves the accuracy-cost Pareto frontier relative to monolithic baselines at matched inference budget, and exhibits out-of-distribution generalization in both domains.
