Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training
Brian Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, Bhavya Kailkhura
TL;DR
This work tackles the inefficiency of on-policy RL for large language model post-training by introducing Trajectory Balance with Asynchrony (TBA), a distributed off-policy RL framework that decouples exploration from learning. By employing a trajectory balance objective with VarGrad and a Searcher-Trainer architecture, TBA leverages large replay buffers to learn from diverse, off-policy data in parallel, achieving substantial speedups (up to 50x) while maintaining or surpassing baseline performance across mathematical reasoning, preference-tuning, and automated red-teaming tasks. The paper also explores scalability via TBA', a simplified variant suitable for larger models, and analyzes the tradeoffs of off-policy data through the recency/reward sampling parameter m. Overall, TBA demonstrates robust, scalable, and efficient LLM post-training, enabling faster deployment and broader exploration for alignment and safety objectives.
Abstract
Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, on-policy algorithms used for post-training are not naturally robust to a diversified content of experience replay buffers, which asynchronous off-policy actors can efficiently populate in parallel to training. We propose efficiently learning on such off-policy data via Trajectory Balance with Asynchrony (TBA), an approach to asynchronous RL for LLMs that leverages the principled off-policy TB objective. On math, preference-tuning, and automated red-teaming tasks, we post-train models ranging from Pythia 410M to Qwen 2.5 7B, finding TBA offers speed and performance boosts over strong baselines like Online DPO and Dr. GRPO. Beyond TBA's performance benefits (high accuracy even as asynchrony grows) and speedups ($4\times$ or more), we show its reward- and recency-prioritizing sampling enable further gains as data generation is scaled. Our code is available at https://github.com/bbartoldson/TBA.
