QUAR-VLA: Vision-Language-Action Model for Quadruped Robots
Pengxiang Ding, Han Zhao, Wenjie Zhang, Wenxuan Song, Min Zhang, Siteng Huang, Ningxi Yang, Donglin Wang
TL;DR
This work introduces QUAR-VLA, a Vision-Language-Action framework for quadruped robots that integrates visual perception and language instructions to produce executable actions, addressing the limitations of perception-only or language-only approaches. It contributes QUARD, a large-scale quadruped dataset with vision, language, and robot commands, and QUART, a decoder-only transformer that maps first-person imagery and textual instructions to discrete action tokens which are detokenized into real commands; a sim-to-real co-training pipeline bridges the sim2real gap. Extensive experiments show QUART achieves superior multi-task performance and strong generalization to unseen objects and instructions, with robust sim2real transfer when simulated data is used. Overall, the work demonstrates emergent capabilities in quadruped control, such as complex locomotion and whole-body manipulation guided by language, advancing autonomous and versatile legged robots.
Abstract
The important manifestation of robot intelligence is the ability to naturally interact and autonomously make decisions. Traditional approaches to robot control often compartmentalize perception, planning, and decision-making, simplifying system design but limiting the synergy between different information streams. This compartmentalization poses challenges in achieving seamless autonomous reasoning, decision-making, and action execution. To address these limitations, a novel paradigm, named Vision-Language-Action tasks for QUAdruped Robots (QUAR-VLA), has been introduced in this paper. This approach tightly integrates visual information and instructions to generate executable actions, effectively merging perception, planning, and decision-making. The central idea is to elevate the overall intelligence of the robot. Within this framework, a notable challenge lies in aligning fine-grained instructions with visual perception information. This emphasizes the complexity involved in ensuring that the robot accurately interprets and acts upon detailed instructions in harmony with its visual observations. Consequently, we propose QUAdruped Robotic Transformer (QUART), a family of VLA models to integrate visual information and instructions from diverse modalities as input and generates executable actions for real-world robots and present QUAdruped Robot Dataset (QUARD), a large-scale multi-task dataset including navigation, complex terrain locomotion, and whole-body manipulation tasks for training QUART models. Our extensive evaluation (4000 evaluation trials) shows that our approach leads to performant robotic policies and enables QUART to obtain a range of emergent capabilities.
