Table of Contents
Fetching ...

HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers

Jianke Zhang, Yanjiang Guo, Xiaoyu Chen, Yen-Jen Wang, Yucheng Hu, Chengming Shi, Jianyu Chen

TL;DR

This work tackles the bottleneck of slow inference in Vision-Language-Action robotics by introducing HiRT, a hierarchical transformer framework that decouples slow, VLM-driven latent understanding from a fast latent-conditioned policy. By caching VLM embeddings and running a lightweight policy at high frequency, HiRT achieves near-VLM-level generalization with significantly improved control speed (up to 9.8 Hz) and robust performance in dynamic manipulation tasks. Empirical results across simulated benchmarks and real robots show HiRT substantially boosts success in dynamic tasks (from 48% to 75%) while maintaining competitive static-task performance. The approach offers a practical path to deploying powerful VLM-based control in real-time robotic systems, with potential for broader multi-task and real-world applicability.

Abstract

Large Vision-Language-Action (VLA) models, leveraging powerful pre trained Vision-Language Models (VLMs) backends, have shown promise in robotic control due to their impressive generalization ability. However, the success comes at a cost. Their reliance on VLM backends with billions of parameters leads to high computational costs and inference latency, limiting the testing scenarios to mainly quasi-static tasks and hindering performance in dynamic tasks requiring rapid interactions. To address these limitations, this paper proposes HiRT, a Hierarchical Robot Transformer framework that enables flexible frequency and performance trade-off. HiRT keeps VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features. Experiment results in both simulation and real-world settings demonstrate significant improvements over baseline methods. Empirically, in static tasks, we double the control frequency and achieve comparable success rates. Additionally, on novel real-world dynamic ma nipulation tasks which are challenging for previous VLA models, HiRT improves the success rate from 48% to 75%.

HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers

TL;DR

This work tackles the bottleneck of slow inference in Vision-Language-Action robotics by introducing HiRT, a hierarchical transformer framework that decouples slow, VLM-driven latent understanding from a fast latent-conditioned policy. By caching VLM embeddings and running a lightweight policy at high frequency, HiRT achieves near-VLM-level generalization with significantly improved control speed (up to 9.8 Hz) and robust performance in dynamic manipulation tasks. Empirical results across simulated benchmarks and real robots show HiRT substantially boosts success in dynamic tasks (from 48% to 75%) while maintaining competitive static-task performance. The approach offers a practical path to deploying powerful VLM-based control in real-time robotic systems, with potential for broader multi-task and real-world applicability.

Abstract

Large Vision-Language-Action (VLA) models, leveraging powerful pre trained Vision-Language Models (VLMs) backends, have shown promise in robotic control due to their impressive generalization ability. However, the success comes at a cost. Their reliance on VLM backends with billions of parameters leads to high computational costs and inference latency, limiting the testing scenarios to mainly quasi-static tasks and hindering performance in dynamic tasks requiring rapid interactions. To address these limitations, this paper proposes HiRT, a Hierarchical Robot Transformer framework that enables flexible frequency and performance trade-off. HiRT keeps VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features. Experiment results in both simulation and real-world settings demonstrate significant improvements over baseline methods. Empirically, in static tasks, we double the control frequency and achieve comparable success rates. Additionally, on novel real-world dynamic ma nipulation tasks which are challenging for previous VLA models, HiRT improves the success rate from 48% to 75%.
Paper Structure (21 sections, 4 equations, 7 figures, 4 tables)

This paper contains 21 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Illustration of our proposed HiRT high-level architecture.(a) Unlike large VLA models that directly output low-level actions with VLM, (b) HiRT is a hierarchical policy based on VLM. Given a task language instruction, the VLM encodes the observations into features that integrate multimodal information, and then a lightweight action policy conditions this latent to generate low-level actions asynchronously. As shown in (c), our method can achieve higher performance and significantly improve inference speed.
  • Figure 2: HiRT network architecture. The instruction is transformed into a continuous latent with sampled visual observation with a vision-language model and is cached into a latent buffer. At each step of inference, the pre-trained vision encoder encodes visual observations conditioned on the latest latent, and then the reduced vision-language tokens are decoded to low-level action with a conditioned action head.
  • Figure 3: Visualization of the tasks in three domains. The left is Metaworld yu2020meta in which we focus on the ability to learn multi-tasks. The middle depicts Franka-Kitchen gupta2019relay in which we study the ability to generalize to new scenes. The right shows our real-world settings, in which the model is trained on simple quasi-static tasks and tested on much more complex scenarios with unseen objects.
  • Figure 4: Speed and Performance of HiRT with different VLM frequency. Each chart compares the different states of HiRT with Vanilla-VLA and RT-1.
  • Figure 5: Visualized Dynamic Tasks.
  • ...and 2 more figures