Discover, Learn, and Reinforce: Scaling Vision-Language-Action Pretraining with Diverse RL-Generated Trajectories
Rushuai Yang, Zhiyuan Feng, Tianxiang Zhang, Kaixin Wang, Chuheng Zhang, Li Zhao, Xiu Su, Yi Chen, Jiang Bian
TL;DR
The paper tackles the data scarcity and limited diversity problem in vision-language-action pretraining by introducing Discover, Learn, and Reinforce (DLR), a three-stage framework that generates a diverse set of high-success robotic trajectories through pattern discovery, pattern-conditioned behavior cloning, and pattern-specific online reinforcement. By decoupling diversity from the main exploration process and anchoring it to the successful state manifold, DLR mitigates exploration penalties common to mutual-information-based approaches and preserves multiple high-quality behavioral patterns. The authors provide theoretical guarantees showing how pattern diversity is preserved under reasonable assumptions and demonstrate empirically on LIBERO that DLR yields richer trajectory distributions than standard offline-to-online RL baselines, translating into superior downstream VLA generalization and positive data-scaling trends. The work supports a shift toward algorithmic, multi-pattern data generation as a scalable, cost-effective foundation for embodied foundation models, with potential for automated task and environment generation to further scale pretraining data.
Abstract
Scaling vision-language-action (VLA) model pre-training requires large volumes of diverse, high-quality manipulation trajectories. Most current data is obtained via human teleoperation, which is expensive and difficult to scale. Reinforcement learning (RL) methods learn useful skills through autonomous exploration, making them a viable approach for generating data. However, standard RL training collapses to a narrow execution pattern, limiting its utility for large-scale pre-training. We propose Discover, Lea rn and Reinforce (DLR), an information-theoretic pattern discovery framework that generates multiple distinct, high-success behavioral patterns for VLA pretraining. Empirically, DLR generates a markedly more diverse trajectory corpus on LIBERO. Specifically, it learns multiple distinct, high-success strategies for the same task where standard RL discovers only one, and hence it covers substantially broader regions of the state-action space. When adapted to unseen downstream task suites, VLA models pretrained on our diverse RL data surpass counterparts trained on equal-sized standard RL datasets. Moreover, DLR exhibits positive data-scaling behavior that single-pattern RL lacks. These results position multi-pattern RL as a practical, scalable data engine for embodied foundation models.
