QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

Pengxiang Ding; Han Zhao; Wenjie Zhang; Wenxuan Song; Min Zhang; Siteng Huang; Ningxi Yang; Donglin Wang

QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

Pengxiang Ding, Han Zhao, Wenjie Zhang, Wenxuan Song, Min Zhang, Siteng Huang, Ningxi Yang, Donglin Wang

TL;DR

This work introduces QUAR-VLA, a Vision-Language-Action framework for quadruped robots that integrates visual perception and language instructions to produce executable actions, addressing the limitations of perception-only or language-only approaches. It contributes QUARD, a large-scale quadruped dataset with vision, language, and robot commands, and QUART, a decoder-only transformer that maps first-person imagery and textual instructions to discrete action tokens which are detokenized into real commands; a sim-to-real co-training pipeline bridges the sim2real gap. Extensive experiments show QUART achieves superior multi-task performance and strong generalization to unseen objects and instructions, with robust sim2real transfer when simulated data is used. Overall, the work demonstrates emergent capabilities in quadruped control, such as complex locomotion and whole-body manipulation guided by language, advancing autonomous and versatile legged robots.

Abstract

The important manifestation of robot intelligence is the ability to naturally interact and autonomously make decisions. Traditional approaches to robot control often compartmentalize perception, planning, and decision-making, simplifying system design but limiting the synergy between different information streams. This compartmentalization poses challenges in achieving seamless autonomous reasoning, decision-making, and action execution. To address these limitations, a novel paradigm, named Vision-Language-Action tasks for QUAdruped Robots (QUAR-VLA), has been introduced in this paper. This approach tightly integrates visual information and instructions to generate executable actions, effectively merging perception, planning, and decision-making. The central idea is to elevate the overall intelligence of the robot. Within this framework, a notable challenge lies in aligning fine-grained instructions with visual perception information. This emphasizes the complexity involved in ensuring that the robot accurately interprets and acts upon detailed instructions in harmony with its visual observations. Consequently, we propose QUAdruped Robotic Transformer (QUART), a family of VLA models to integrate visual information and instructions from diverse modalities as input and generates executable actions for real-world robots and present QUAdruped Robot Dataset (QUARD), a large-scale multi-task dataset including navigation, complex terrain locomotion, and whole-body manipulation tasks for training QUART models. Our extensive evaluation (4000 evaluation trials) shows that our approach leads to performant robotic policies and enables QUART to obtain a range of emergent capabilities.

QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

TL;DR

Abstract

Paper Structure (18 sections, 3 equations, 12 figures, 8 tables)

This paper contains 18 sections, 3 equations, 12 figures, 8 tables.

Introduction
Related Work
Method
Problem Setup
Large-scale Quadruped Robot Datasets
Vision-Language-Action Model
Experiments
Implementation Details
Overall Performance
Conclusion & Future Work
Abstract
Details of data collection
More experiments
Detailed results on seen tasks
Detailed results on unseen tasks
...and 3 more sections

Figures (12)

Figure 1: Comparison of QUAR-VA, QUAR-LA, and QUAR-VLA.QUAR-VA solely utilizes coarse-grained vision information, lacking explicit instructions for handling diverse tasks. In contrast, QUAR-LA exclusively relies on language information and lacks of vision information for autonomy. Therefore, QUART-VLA combines both vision information and language instructions as inputs, enabling autonomous problem-solving across a range of tasks, revealing distinct input modalities and task capabilities.
Figure 1: Mission go to the left corner of the object. The left picture is produced by model CLIP. The middle picture is produced by model VC-1. The right picture is produced by QUART.
Figure 2: Overview of QUAR-VLA. Our tasks encompass a diverse range of perception, navigation, and other advanced capability. The Vision-Language-Action (VLA) model first undergoes training with a huge amount of simulation data (259K episodes) and a small amount of real-world data (3K episodes). In the inference phase, images and texts undergo tokenization, after which QUART generates 12-dimensional action tokens. These tokens are subsequently detokenized into valid robot actions and deployed on a physical quadruped robot. This methodology effectively extends the learned capabilities from a simulated environment to real-world applications.
Figure 2: Mission go to the back of the object. The left picture is produced by model CLIP. The middle picture is produced by model VC-1. The right picture is produced by QUART.
Figure 3: The left figure illustrates the trajectory lengths corresponding to different tasks and the right figure illustrates the relationships between tasks. As the difficulty of the skill increases, it can be observed that the average trajectory length gradually increases. The right figure demonstrates the relationship between the types of tasks: perception as foundational capability; basic navigation ability is built on perception; object manipulation, obstacle avoidance, spatial navigation, and environment adaption extend from perception and navigation.
...and 7 more figures

QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

TL;DR

Abstract

QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

Authors

TL;DR

Abstract

Table of Contents

Figures (12)