QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning

Xinyang Tong; Pengxiang Ding; Yiguo Fan; Donglin Wang; Wenjie Zhang; Can Cui; Mingyang Sun; Han Zhao; Hongyin Zhang; Yonghao Dang; Siteng Huang; Shangke Lyu

QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning

Xinyang Tong, Pengxiang Ding, Yiguo Fan, Donglin Wang, Wenjie Zhang, Can Cui, Mingyang Sun, Han Zhao, Hongyin Zhang, Yonghao Dang, Siteng Huang, Shangke Lyu

TL;DR

QUART-Online tackles the latency bottleneck of multimodal language-guided quadruped control by introducing Action Chunk Discretization (ACD), a temporal vector-quantization scheme that compresses action sequences into discrete tokens. The method aligns vision, language, and compressed actions in a unified semantic space and uses an action decoder to reconstruct continuous trajectories, enabling latency-free inference at 50Hz. Experimental results on QUARD show substantial gains in real-time performance and generalization to unseen visuals and instructions, surpassing the original QUART and other baselines while preserving the MLLM's core capabilities. Overall, the work demonstrates that action-space discretization and semantic alignment can unlock real-time, scalable multimodal control for quadruped robots without degrading foundational model performance.

Abstract

This paper addresses the inherent inference latency challenges associated with deploying multimodal large language models (MLLM) in quadruped vision-language-action (QUAR-VLA) tasks. Our investigation reveals that conventional parameter reduction techniques ultimately impair the performance of the language foundation model during the action instruction tuning phase, making them unsuitable for this purpose. We introduce a novel latency-free quadruped MLLM model, dubbed QUART-Online, designed to enhance inference efficiency without degrading the performance of the language foundation model. By incorporating Action Chunk Discretization (ACD), we compress the original action representation space, mapping continuous action values onto a smaller set of discrete representative vectors while preserving critical information. Subsequently, we fine-tune the MLLM to integrate vision, language, and compressed actions into a unified semantic space. Experimental results demonstrate that QUART-Online operates in tandem with the existing MLLM system, achieving real-time inference in sync with the underlying controller frequency, significantly boosting the success rate across various tasks by 65%. Our project page is https://quart-online.github.io.

QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning

TL;DR

Abstract

QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)