Table of Contents
Fetching ...

Discover, Learn, and Reinforce: Scaling Vision-Language-Action Pretraining with Diverse RL-Generated Trajectories

Rushuai Yang, Zhiyuan Feng, Tianxiang Zhang, Kaixin Wang, Chuheng Zhang, Li Zhao, Xiu Su, Yi Chen, Jiang Bian

TL;DR

The paper tackles the data scarcity and limited diversity problem in vision-language-action pretraining by introducing Discover, Learn, and Reinforce (DLR), a three-stage framework that generates a diverse set of high-success robotic trajectories through pattern discovery, pattern-conditioned behavior cloning, and pattern-specific online reinforcement. By decoupling diversity from the main exploration process and anchoring it to the successful state manifold, DLR mitigates exploration penalties common to mutual-information-based approaches and preserves multiple high-quality behavioral patterns. The authors provide theoretical guarantees showing how pattern diversity is preserved under reasonable assumptions and demonstrate empirically on LIBERO that DLR yields richer trajectory distributions than standard offline-to-online RL baselines, translating into superior downstream VLA generalization and positive data-scaling trends. The work supports a shift toward algorithmic, multi-pattern data generation as a scalable, cost-effective foundation for embodied foundation models, with potential for automated task and environment generation to further scale pretraining data.

Abstract

Scaling vision-language-action (VLA) model pre-training requires large volumes of diverse, high-quality manipulation trajectories. Most current data is obtained via human teleoperation, which is expensive and difficult to scale. Reinforcement learning (RL) methods learn useful skills through autonomous exploration, making them a viable approach for generating data. However, standard RL training collapses to a narrow execution pattern, limiting its utility for large-scale pre-training. We propose Discover, Lea rn and Reinforce (DLR), an information-theoretic pattern discovery framework that generates multiple distinct, high-success behavioral patterns for VLA pretraining. Empirically, DLR generates a markedly more diverse trajectory corpus on LIBERO. Specifically, it learns multiple distinct, high-success strategies for the same task where standard RL discovers only one, and hence it covers substantially broader regions of the state-action space. When adapted to unseen downstream task suites, VLA models pretrained on our diverse RL data surpass counterparts trained on equal-sized standard RL datasets. Moreover, DLR exhibits positive data-scaling behavior that single-pattern RL lacks. These results position multi-pattern RL as a practical, scalable data engine for embodied foundation models.

Discover, Learn, and Reinforce: Scaling Vision-Language-Action Pretraining with Diverse RL-Generated Trajectories

TL;DR

The paper tackles the data scarcity and limited diversity problem in vision-language-action pretraining by introducing Discover, Learn, and Reinforce (DLR), a three-stage framework that generates a diverse set of high-success robotic trajectories through pattern discovery, pattern-conditioned behavior cloning, and pattern-specific online reinforcement. By decoupling diversity from the main exploration process and anchoring it to the successful state manifold, DLR mitigates exploration penalties common to mutual-information-based approaches and preserves multiple high-quality behavioral patterns. The authors provide theoretical guarantees showing how pattern diversity is preserved under reasonable assumptions and demonstrate empirically on LIBERO that DLR yields richer trajectory distributions than standard offline-to-online RL baselines, translating into superior downstream VLA generalization and positive data-scaling trends. The work supports a shift toward algorithmic, multi-pattern data generation as a scalable, cost-effective foundation for embodied foundation models, with potential for automated task and environment generation to further scale pretraining data.

Abstract

Scaling vision-language-action (VLA) model pre-training requires large volumes of diverse, high-quality manipulation trajectories. Most current data is obtained via human teleoperation, which is expensive and difficult to scale. Reinforcement learning (RL) methods learn useful skills through autonomous exploration, making them a viable approach for generating data. However, standard RL training collapses to a narrow execution pattern, limiting its utility for large-scale pre-training. We propose Discover, Lea rn and Reinforce (DLR), an information-theoretic pattern discovery framework that generates multiple distinct, high-success behavioral patterns for VLA pretraining. Empirically, DLR generates a markedly more diverse trajectory corpus on LIBERO. Specifically, it learns multiple distinct, high-success strategies for the same task where standard RL discovers only one, and hence it covers substantially broader regions of the state-action space. When adapted to unseen downstream task suites, VLA models pretrained on our diverse RL data surpass counterparts trained on equal-sized standard RL datasets. Moreover, DLR exhibits positive data-scaling behavior that single-pattern RL lacks. These results position multi-pattern RL as a practical, scalable data engine for embodied foundation models.

Paper Structure

This paper contains 42 sections, 6 theorems, 39 equations, 16 figures, 2 tables.

Key Result

Theorem 1

Under assumptions (i)--(iv), for every pattern $j$ and all Stage 3 iterates $t$, Since $R(\tau)=0$ on $\mathcal{T}_0$, the expected ascent direction is dominated by $\mathcal{T}_j^+$, so PPO updates remain localized and converge to a local optimum within $\mathcal{T}_j^+$.

Figures (16)

  • Figure 1: Comparison between our DLR framework and a standard offline-to-online RL baseline. The top row illustrates our three-stage DLR process: (1) We discover latent patterns from human data using a VAE-based approach. (2) We learn a pattern-conditioned policy via behavior cloning on the now-labeled data. (3) We refine each pattern-conditioned policy online with a sparse success reward. This process results in a diverse, multi-modal state visitation distribution. The bottom row shows a standard offline-to-online RL baseline: (1) A policy is initialized via behavior cloning on the entire unlabeled human dataset. (2) The policy is refined online with a sparse success reward. This standard approach leads to mode collapse, resulting in a uni-modal state visitation distribution.
  • Figure 2: The Learning Process of DLR. Each panel represents the same state space $\mathcal{S}$. The amorphous regions depict areas of visited states under different policies, and the black dots represent the corresponding state visitation distributions induced by the policy. (a) Given suboptimal human demonstrations ($S_{\text{human}}$, gray) that cover multiple distinct successful strategies, our goal is to learn several optimal policies with high task success rates and high pattern diversity from an initial, randomly initialized policy distribution $d^{\pi_\text{init}}$. (b) In Stage 1, we discover underlying behavioral patterns from $S_{\text{human}}$ by clustering the states into distinct modes, each identified by a latent code ($z_1$, $z_2$, $z_3$) and visualized with a different color (blue, purple, yellow). (c) In Stage 2, we use behavior cloning to train a conditional policy $\pi(\cdot|z)$ that imitates each discovered mode. The dashed arrows indicate the cloning process, mapping from a general initial policy to more specialized ones. (d) In Stage 3, each pattern-conditioned policy is fine-tuned with sparse task rewards, letting each one converge to its own optimal version, $d_k^*$.
  • Figure 3: Environment Setting and Data Generation Pipeline. (a) For each task, we train a lightweight RL policy using our DLR framework, then collect high-quality trajectories via policy rollouts. Each color-coded database represents trajectories associated with a distinct behavioral pattern discovered by DLR. We combine data from all tasks for VLA pretraining. (b, c) We use SFT to pretrain variants of VLA architectures on the RL-generated data, then employ the pretrained VLA models for fast adaptation to unseen downstream tasks that the models have never encountered during pretraining.
  • Figure 4: Qualitative visualizations of different behaviors discovered in manipulation tasks. We show that DLR is able to explore diverse behavior patterns under the same initial state. Left:Pick up the book and place it into the caddy. The robot needs to adjust the book’s orientation to fit into the upright caddy. DLR learns two strategies: rotating the book clockwise or counterclockwise to align it properly. Middle:Open the stove. The robot learns to open the door using different contact strategies, including pulling with the effector's edge followed by a center push, versus consistently pulling with the end-effector. Right:Close the stove. A cup blocks the door, requiring obstacle-aware coordination. The robot learns two distinct behaviors, either picking up the cup and placing it backward before closing, or pushing the cup forward to clear the path before closing the stove.
  • Figure 5: Trajectory visualizations. The task is to close the bottom drawer of the cabinet; the dot indicates the initial position. (a) Standard single-pattern RL converges to a single dominant strategy with limited variation. (b) DLR discovers trajectories with higher variance, DTW refers to the Dynamic Time Warping distance between the two trajectories.
  • ...and 11 more figures

Theorems & Definitions (11)

  • Theorem 1: Pattern preservation with failure moat and KL-to-init
  • Lemma 1: Trajectory-level policy-gradient identity
  • proof
  • Lemma 2: Pinsker
  • Lemma 3: Leakage growth under a KL-to-init budget
  • proof
  • Lemma 4: Cross-pattern gradient bound
  • proof
  • Lemma 5: Failure term does not create positive ascent
  • proof
  • ...and 1 more