Table of Contents
Fetching ...

Pelican-VL 1.0: A Foundation Brain Model for Embodied Intelligence

Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Shuai Zhang, Zeyuan Ding, Jiayu Hu, Hanzhe Shan, Zhenwei Niu, Zhaoyang Liu, Shuang Liu, Yue Zhao, Junbo Qi, Qinfan Zhang, Dengjie Li, Yidong Wang, Jiachen Luo, Yong Dai, Zenglin Xu, Bin Shen, Qifan Wang, Jian Tang, Xiaozhu Ju

TL;DR

Pelican-VL 1.0 introduces an open-source embodied brain model at 7–72B parameters, unified by the Deliberate Practice Policy Optimization (DPPO) framework that couples RL-based skill discovery with supervised consolidation. Through a metaloop of Exploratory Grounding and Targeted Remediation, the approach leverages large-scale, mixed-modal data and a unified preference-learning objective to achieve robust spatial, temporal, and planning capabilities in real-world embodied tasks. Extensive hardware-enabled experiments demonstrate state-of-the-art performance on contact-rich manipulation, affordance-based reasoning, and long-horizon multi-agent planning, while revealing richer, diagnostic benchmarks across nine embodied capability dimensions. By open-sourcing both models and the DPPO toolchain, the work lays a foundation for scalable, self-improving embodied AI and a pathway toward autonomous, real-world robotic intelligence.

Abstract

This report presents Pelican-VL 1.0, a new family of open-source embodied brain models with parameter scales ranging from 7 billion to 72 billion. Our explicit mission is clearly stated as: To embed powerful intelligence into various embodiments. Pelican-VL 1.0 is currently the largest-scale open-source embodied multimodal brain model. Its core advantage lies in the in-depth integration of data power and intelligent adaptive learning mechanisms. Specifically, metaloop distilled a high-quality dataset from a raw dataset containing 4+ billion tokens. Pelican-VL 1.0 is trained on a large-scale cluster of 1000+ A800 GPUs, consuming over 50k+ A800 GPU-hours per checkpoint. This translates to a 20.3% performance uplift from its base model and outperforms 100B-level open-source counterparts by 10.6%, placing it on par with leading proprietary systems on well-known embodied benchmarks. We establish a novel framework, DPPO (Deliberate Practice Policy Optimization), inspired by human metacognition to train Pelican-VL 1.0. We operationalize this as a metaloop that teaches the AI to practice deliberately, which is a RL-Refine-Diagnose-SFT loop.

Pelican-VL 1.0: A Foundation Brain Model for Embodied Intelligence

TL;DR

Pelican-VL 1.0 introduces an open-source embodied brain model at 7–72B parameters, unified by the Deliberate Practice Policy Optimization (DPPO) framework that couples RL-based skill discovery with supervised consolidation. Through a metaloop of Exploratory Grounding and Targeted Remediation, the approach leverages large-scale, mixed-modal data and a unified preference-learning objective to achieve robust spatial, temporal, and planning capabilities in real-world embodied tasks. Extensive hardware-enabled experiments demonstrate state-of-the-art performance on contact-rich manipulation, affordance-based reasoning, and long-horizon multi-agent planning, while revealing richer, diagnostic benchmarks across nine embodied capability dimensions. By open-sourcing both models and the DPPO toolchain, the work lays a foundation for scalable, self-improving embodied AI and a pathway toward autonomous, real-world robotic intelligence.

Abstract

This report presents Pelican-VL 1.0, a new family of open-source embodied brain models with parameter scales ranging from 7 billion to 72 billion. Our explicit mission is clearly stated as: To embed powerful intelligence into various embodiments. Pelican-VL 1.0 is currently the largest-scale open-source embodied multimodal brain model. Its core advantage lies in the in-depth integration of data power and intelligent adaptive learning mechanisms. Specifically, metaloop distilled a high-quality dataset from a raw dataset containing 4+ billion tokens. Pelican-VL 1.0 is trained on a large-scale cluster of 1000+ A800 GPUs, consuming over 50k+ A800 GPU-hours per checkpoint. This translates to a 20.3% performance uplift from its base model and outperforms 100B-level open-source counterparts by 10.6%, placing it on par with leading proprietary systems on well-known embodied benchmarks. We establish a novel framework, DPPO (Deliberate Practice Policy Optimization), inspired by human metacognition to train Pelican-VL 1.0. We operationalize this as a metaloop that teaches the AI to practice deliberately, which is a RL-Refine-Diagnose-SFT loop.

Paper Structure

This paper contains 65 sections, 22 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Performance comparison of Pelican-VL1.0. (Left) Comparison against models with $\le$100B parameters. The shaded (pink) region highlights the performance gain over our baseline. (Right) Comparison against models with $>$100B parameters, including leading open-source and proprietary models, where our model also demonstrates SOTA performance.
  • Figure 2: Overview of our training framework. This framework implements an iterative RL-SFT loop that leverages Rollout Logging and Difficulty-Aware Sampling to dynamically curate data. This adaptive data selection process is designed to achieve two complementary objectives: rapid capability enhancement during the RL phase and stable modal alignment during the SFT phase.
  • Figure 3: Overview of the metaloop data selection process.
  • Figure 4: Performance Evolution at Each Stage of DPPO.
  • Figure 5: Distributional Shift of Training Data Relative to Distinct Benchmarks in RL Training. For Where2Place and VSI-Bench-QA, the rewards are numerical and the model’s performance is measured by the average score per rollout, whereas for the other datasets, the rewards are binary and performance is measured by the number of correct answers.
  • ...and 9 more figures