Table of Contents
Fetching ...

Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization

Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Yingji Zhang, Shuai Zhang, Zeyuan Ding, Jiayu Hu, Haozhe Shan, Junbo Qi, Yan Bai, Dengjie Li, Jiachen Luo, Yidong Wang, Yong Dai, Zenglin Xu, Bin Shen, Qifan Wang, Jian Tang, Xiaozhu Ju

TL;DR

DPPO introduces a metacognitive training loop that alternates reinforcement learning-based weakness discovery with targeted supervised fine-tuning to overcome data and compute bottlenecks in embodied vision-language systems. By unifying RL and SFT under a Preference Learning framework, the approach automatically identifies failure modes, allocates learning resources to hard cases, and consolidates improvements with diverse data sources. Empirically, Pelican-VL 1.0 (72B) achieves a 20.3% uplift over its base and outperforms open-source 100B-scale models by 10.6%, while maintaining general-domain performance and reducing forgetting. The open-source release and a diagnostics-focused capability taxonomy further provide a practical, scalable path for building versatile, self-improving embodied agents.

Abstract

Developing a universal and versatile embodied intelligence system presents two primary challenges: the critical embodied data bottleneck, where real-world data is scarce and expensive, and the algorithmic inefficiency of existing methods, which are resource-prohibitive. To address these limitations, we introduce Deliberate Practice Policy Optimization (DPPO), a metacognitive ``Metaloop'' training framework that dynamically alternates between supervised fine-tuning (competence expansion) and reinforcement learning (skill refinement). This enables automatic weakness identification and targeted resource allocation, specifically designed to maximize learning efficiency from sparse, finite data. Theoretically, DPPO can be formalised as a unified preference-learning framework. Empirically, training a vision-language embodied model with DPPO, referred to as Pelican-VL 1.0, yields a 20.3% performance improvement over the base model and surpasses open-source models at the 100B-parameter scale by 10.6%. We are open-sourcing both the models and code, providing the first systematic framework that alleviates the data and resource bottleneck and enables the community to build versatile embodied agents efficiently.

Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization

TL;DR

DPPO introduces a metacognitive training loop that alternates reinforcement learning-based weakness discovery with targeted supervised fine-tuning to overcome data and compute bottlenecks in embodied vision-language systems. By unifying RL and SFT under a Preference Learning framework, the approach automatically identifies failure modes, allocates learning resources to hard cases, and consolidates improvements with diverse data sources. Empirically, Pelican-VL 1.0 (72B) achieves a 20.3% uplift over its base and outperforms open-source 100B-scale models by 10.6%, while maintaining general-domain performance and reducing forgetting. The open-source release and a diagnostics-focused capability taxonomy further provide a practical, scalable path for building versatile, self-improving embodied agents.

Abstract

Developing a universal and versatile embodied intelligence system presents two primary challenges: the critical embodied data bottleneck, where real-world data is scarce and expensive, and the algorithmic inefficiency of existing methods, which are resource-prohibitive. To address these limitations, we introduce Deliberate Practice Policy Optimization (DPPO), a metacognitive ``Metaloop'' training framework that dynamically alternates between supervised fine-tuning (competence expansion) and reinforcement learning (skill refinement). This enables automatic weakness identification and targeted resource allocation, specifically designed to maximize learning efficiency from sparse, finite data. Theoretically, DPPO can be formalised as a unified preference-learning framework. Empirically, training a vision-language embodied model with DPPO, referred to as Pelican-VL 1.0, yields a 20.3% performance improvement over the base model and surpasses open-source models at the 100B-parameter scale by 10.6%. We are open-sourcing both the models and code, providing the first systematic framework that alleviates the data and resource bottleneck and enables the community to build versatile embodied agents efficiently.

Paper Structure

This paper contains 39 sections, 12 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of DPPO. The framework implements an iterative RL–SFT metaloop that leverages rollout logging and difficulty-aware sampling for dynamic data curation. This adaptive process alternates between revealing weaknesses in the RL phase and refining them in the SFT phase, forming a continual self-diagnosis and self-refinement cycle.
  • Figure 2: Performance evolution across the stages of Pelican-VL 72B. The model exhibits continuous improvements on embodied benchmarks while maintaining stable results on general datasets, as demonstrated by its consistent performance on MVBench, a general-domain benchmark (Finding 1).
  • Figure 3: Distributional shift of training data relative to distinct benchmarks in (1-1) RL training on the Pelican-VL 7B model. For Where2Place, the rewards are numerical and model performance is measured by the average score per rollout. For the other datasets, the rewards are binary, and performance is measured by the number of correct answers. The progressively darker line colors indicate the progression of RL training, where we observe a steady reduction in unlearned tasks and a corresponding increase in successfully solved tasks. (Finding 2).
  • Figure 4: Comparison of SFT, RL, and DPPO on the 7B model in terms of performance gain on the VSI-Bench and forgetting on general benchmarks. The results demonstrate a substantial performance gain of 54.3, while the observed performance degradation remains notably limited, for example, 1.9 for DPPO, 5.0 for SFT, and 24.8 for RL on the MMStar dataset (Finding 3).
  • Figure 5: Distributional shift of the Pelican-VL 7B model's trajectory embedding centroids across DPPO metaloop cycles, visualized via t-SNE on five benchmarks. The figure tracks how the model's representations evolve over alternating training steps, producing distinct trajectories for each benchmark. Visualization details are provided in Sec. \ref{['part:distshift']}. The divergent trajectories across benchmarks illustrate that different tasks shift the model in different representational directions, making a single training stage insufficient and motivating our multi-stage DPPO metaloop design (Finding 5).
  • ...and 2 more figures