Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization

Yi Zhang; Che Liu; Xiancong Ren; Hanchu Ni; Yingji Zhang; Shuai Zhang; Zeyuan Ding; Jiayu Hu; Haozhe Shan; Junbo Qi; Yan Bai; Dengjie Li; Jiachen Luo; Yidong Wang; Yong Dai; Zenglin Xu; Bin Shen; Qifan Wang; Jian Tang; Xiaozhu Ju

Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization

Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Yingji Zhang, Shuai Zhang, Zeyuan Ding, Jiayu Hu, Haozhe Shan, Junbo Qi, Yan Bai, Dengjie Li, Jiachen Luo, Yidong Wang, Yong Dai, Zenglin Xu, Bin Shen, Qifan Wang, Jian Tang, Xiaozhu Ju

TL;DR

DPPO introduces a metacognitive training loop that alternates reinforcement learning-based weakness discovery with targeted supervised fine-tuning to overcome data and compute bottlenecks in embodied vision-language systems. By unifying RL and SFT under a Preference Learning framework, the approach automatically identifies failure modes, allocates learning resources to hard cases, and consolidates improvements with diverse data sources. Empirically, Pelican-VL 1.0 (72B) achieves a 20.3% uplift over its base and outperforms open-source 100B-scale models by 10.6%, while maintaining general-domain performance and reducing forgetting. The open-source release and a diagnostics-focused capability taxonomy further provide a practical, scalable path for building versatile, self-improving embodied agents.

Abstract

Developing a universal and versatile embodied intelligence system presents two primary challenges: the critical embodied data bottleneck, where real-world data is scarce and expensive, and the algorithmic inefficiency of existing methods, which are resource-prohibitive. To address these limitations, we introduce Deliberate Practice Policy Optimization (DPPO), a metacognitive ``Metaloop'' training framework that dynamically alternates between supervised fine-tuning (competence expansion) and reinforcement learning (skill refinement). This enables automatic weakness identification and targeted resource allocation, specifically designed to maximize learning efficiency from sparse, finite data. Theoretically, DPPO can be formalised as a unified preference-learning framework. Empirically, training a vision-language embodied model with DPPO, referred to as Pelican-VL 1.0, yields a 20.3% performance improvement over the base model and surpasses open-source models at the 100B-parameter scale by 10.6%. We are open-sourcing both the models and code, providing the first systematic framework that alleviates the data and resource bottleneck and enables the community to build versatile embodied agents efficiently.

Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization

TL;DR

Abstract

Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)