Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

Haitao Lin; Hanyang Yu; Jingshun Huang; He Zhang; Yonggen Ling; Ping Tan; Xiangyang Xue; Yanwei Fu

Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

Haitao Lin, Hanyang Yu, Jingshun Huang, He Zhang, Yonggen Ling, Ping Tan, Xiangyang Xue, Yanwei Fu

TL;DR

Pose-VLA introduces a two-stage framework that decouples Vision-Language-Action learning into universal 3D spatial pretraining in a camera-centric space and subsequent embodiment alignment. By representing objects and trajectories with discrete pose tokens and integrating RGB-D with camera intrinsics, the model learns robust geometric priors that transfer efficiently to robotic control with limited demonstrations. The approach achieves state-of-the-art 3D grounding on Objectron and strong results on RoboTwin 2.0 and LIBERO, while real-world experiments demonstrate practical data efficiency (~100 demos/task) and improved generalization across rigid, articulated, and deformable objects. This work advocates shifting VLA pretraining toward embodied-aware, geometry-grounded foundations to enable scalable, generalizable robotic manipulation.

Abstract

Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency because they entangle high-level perception with sparse, embodiment-specific action supervision. Since these models typically rely on VLM backbones optimized for Visual Question Answering (VQA), they excel at semantic identification but often overlook subtle 3D state variations that dictate distinct action patterns. To resolve these misalignments, we propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors in a unified camera-centric space, and a post-training phase for efficient embodiment alignment within robot-specific action space. By introducing discrete pose tokens as a universal representation, Pose-VLA seamlessly integrates spatial grounding from diverse 3D datasets with geometry-level trajectories from robotic demonstrations. Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment through trajectory supervision. Extensive evaluations demonstrate that Pose-VLA achieves state-of-the-art results on RoboTwin 2.0 with a 79.5% average success rate and competitive performance on LIBERO at 96.0%. Real-world experiments further showcase robust generalization across diverse objects using only 100 demonstrations per task, validating the efficiency of our pre-training paradigm.

Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

TL;DR

Abstract

Paper Structure (19 sections, 3 equations, 7 figures, 6 tables)

This paper contains 19 sections, 3 equations, 7 figures, 6 tables.

Introduction
Related Work
Method
Preliminaries
VLM architecture
Unified Pose Representation
Adding Prior Conditioning
Training Strategy
Pre-training Datasets
Experiment
Evaluation in 3D Grounding Benchmarks
Evaluation in Simulation Benchmarks
Evaluation in Real-world Tasks.
Ablation Study
Conclusion
...and 4 more sections

Figures (7)

Figure 1: Overview of Pose-VLA. Unlike previous VLAs that rely solely on sparse action supervision, our approach decouples policy learning into 2 stages by using unified pose token: (1) Pre-training, extracting universal 3D spatial priors in a unified camera-centric space; and (2) Alignment, adapting these priors to specific embodiments. This decoupling allows the model to leverage diverse 3D datasets, enabling efficient transfer as backbone when adapting to robotic control with only few-shot fine-tuning.
Figure 2: Pipeline of Pose-VLA. Pose-VLA decouples VLA training into: (1) Pre-training for extracting universal 3D spatial priors in a camera-centric space, and (2) Post-training for embodiment alignment. The VLM predicts a structured sequence $\mathcal{S} = (\tau_1, \dots, \tau_T)$ via next-token prediction, where each tuple $\tau_t = \{\mathbf{c}_t, \mathbf{b}_t, \mathbf{p}_t\}$ consists of a category $\mathbf{c}_t$, 2D box center $\mathbf{b}_t$, and camera-centric pose $\mathbf{p}_t$. To enhance spatial reasoning, auxiliary 3D geometry priors are integrated via additive fusion with RGB embeddings, analogous to positional encodings. This unified format enables seamless knowledge transfer from diverse 3D datasets to robotic domains, achieving robust alignment with minimal demonstrations.
Figure 3: Generalization of 3D spatial grounding across unseen scenarios. Pose-VLA exhibits robust generalization across various unseen settings, ranging from indoor tabletop layouts to complex robotic manipulation workspaces, providing more precise geometric localization than baseline methods.
Figure 4: Real-world setup of four representative tasks. Our platform uses a dual-arm Xtrainer with head and wrist cameras. The benchmark includes: (1) Tableware Arrangement, (2) Hanging a mug, (3) Long-horizon drawer interaction, and (4) Deformable towel folding. Success rates are evaluated over 20 trials per task.
Figure 5: Success rate comparison of Pose-VLA and baseline models across four real-world manipulation tasks. Each model is evaluated over 20 trials per task, with success rates reported as percentages. Under the same demonstration scale, Pose-VLA consistently outperforms current vision-language-action baselines, especially in long-horizon and deformable object tasks.
...and 2 more figures

Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

TL;DR

Abstract

Universal Pose Pretraining for Generalizable Vision-Language-Action Policies

Authors

TL;DR

Abstract

Table of Contents

Figures (7)