Table of Contents
Fetching ...

QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning

Xinyang Tong, Pengxiang Ding, Yiguo Fan, Donglin Wang, Wenjie Zhang, Can Cui, Mingyang Sun, Han Zhao, Hongyin Zhang, Yonghao Dang, Siteng Huang, Shangke Lyu

TL;DR

QUART-Online tackles the latency bottleneck of multimodal language-guided quadruped control by introducing Action Chunk Discretization (ACD), a temporal vector-quantization scheme that compresses action sequences into discrete tokens. The method aligns vision, language, and compressed actions in a unified semantic space and uses an action decoder to reconstruct continuous trajectories, enabling latency-free inference at 50Hz. Experimental results on QUARD show substantial gains in real-time performance and generalization to unseen visuals and instructions, surpassing the original QUART and other baselines while preserving the MLLM's core capabilities. Overall, the work demonstrates that action-space discretization and semantic alignment can unlock real-time, scalable multimodal control for quadruped robots without degrading foundational model performance.

Abstract

This paper addresses the inherent inference latency challenges associated with deploying multimodal large language models (MLLM) in quadruped vision-language-action (QUAR-VLA) tasks. Our investigation reveals that conventional parameter reduction techniques ultimately impair the performance of the language foundation model during the action instruction tuning phase, making them unsuitable for this purpose. We introduce a novel latency-free quadruped MLLM model, dubbed QUART-Online, designed to enhance inference efficiency without degrading the performance of the language foundation model. By incorporating Action Chunk Discretization (ACD), we compress the original action representation space, mapping continuous action values onto a smaller set of discrete representative vectors while preserving critical information. Subsequently, we fine-tune the MLLM to integrate vision, language, and compressed actions into a unified semantic space. Experimental results demonstrate that QUART-Online operates in tandem with the existing MLLM system, achieving real-time inference in sync with the underlying controller frequency, significantly boosting the success rate across various tasks by 65%. Our project page is https://quart-online.github.io.

QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning

TL;DR

QUART-Online tackles the latency bottleneck of multimodal language-guided quadruped control by introducing Action Chunk Discretization (ACD), a temporal vector-quantization scheme that compresses action sequences into discrete tokens. The method aligns vision, language, and compressed actions in a unified semantic space and uses an action decoder to reconstruct continuous trajectories, enabling latency-free inference at 50Hz. Experimental results on QUARD show substantial gains in real-time performance and generalization to unseen visuals and instructions, surpassing the original QUART and other baselines while preserving the MLLM's core capabilities. Overall, the work demonstrates that action-space discretization and semantic alignment can unlock real-time, scalable multimodal control for quadruped robots without degrading foundational model performance.

Abstract

This paper addresses the inherent inference latency challenges associated with deploying multimodal large language models (MLLM) in quadruped vision-language-action (QUAR-VLA) tasks. Our investigation reveals that conventional parameter reduction techniques ultimately impair the performance of the language foundation model during the action instruction tuning phase, making them unsuitable for this purpose. We introduce a novel latency-free quadruped MLLM model, dubbed QUART-Online, designed to enhance inference efficiency without degrading the performance of the language foundation model. By incorporating Action Chunk Discretization (ACD), we compress the original action representation space, mapping continuous action values onto a smaller set of discrete representative vectors while preserving critical information. Subsequently, we fine-tune the MLLM to integrate vision, language, and compressed actions into a unified semantic space. Experimental results demonstrate that QUART-Online operates in tandem with the existing MLLM system, achieving real-time inference in sync with the underlying controller frequency, significantly boosting the success rate across various tasks by 65%. Our project page is https://quart-online.github.io.

Paper Structure

This paper contains 14 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of QUART-Online: With the implementation of action chunk discretization, QUART-Online enhances the existing MLLM system, which was previously operating at a low frequency, enabling more precise actions to be executed in real-time at a frequency of 50Hz.
  • Figure 2: Comparison of QUART and QUART-Online. QUART-Online enhances the inference process by employing two key strategies: 1) it accelerates MLLM inference by generating a reduced number of tokens in the latent space as opposed to the raw space (2.5x); 2) it introduces an action chunk mechanism during the action decoding phase, facilitating higher-frequency inference via multi-step predictions (10x). By integrating these two approaches, QUART-Online successfully increases the inference rate of the original large quadruped robot model, QUART, from 2Hz to 50Hz, enhancing the model's accuracy in rapidly changing scenarios.
  • Figure 3: Overall framework of QUART-Online.
  • Figure 4: The top half of the comparison highlights the QUART method's latency-induced collision with the red bar (red highlight). The bottom half displays the QUART-Online method's agile response, successfully avoiding the obstacle (green highlight).
  • Figure 5: Experiments in the real world.