Table of Contents
Fetching ...

TA-VLA: Elucidating the Design Space of Torque-aware Vision-Language-Action Models

Zongzheng Zhang, Haobo Xu, Zhuo Yang, Chenghao Yue, Zehao Lin, Huan-ang Gao, Ziwei Wang, Hao Zhao

TL;DR

This work introduces Torque-aware Vision-Language-Action (VLA) models to bridge the gap between perceptual understanding and physical interaction in manipulation tasks. By systematically exploring where and how to inject torque information, the authors demonstrate that decoder-side torque adapters, a single-token torque history, and a joint action–torque diffusion objective yield robust improvements on contact-rich and regular tasks. The approach shows strong transfer across different VLA backbones and robotic embodiments, indicating broad applicability. The findings provide practical design principles for enriching pretrained VLA models with proprioceptive cues to achieve more reliable and generalizable manipulation policies.

Abstract

Many robotic manipulation tasks require sensing and responding to force signals such as torque to assess whether the task has been successfully completed and to enable closed-loop control. However, current Vision-Language-Action (VLA) models lack the ability to integrate such subtle physical feedback. In this work, we explore Torque-aware VLA models, aiming to bridge this gap by systematically studying the design space for incorporating torque signals into existing VLA architectures. We identify and evaluate several strategies, leading to three key findings. First, introducing torque adapters into the decoder consistently outperforms inserting them into the encoder.Third, inspired by joint prediction and planning paradigms in autonomous driving, we propose predicting torque as an auxiliary output, which further improves performance. This strategy encourages the model to build a physically grounded internal representation of interaction dynamics. Extensive quantitative and qualitative experiments across contact-rich manipulation benchmarks validate our findings.

TA-VLA: Elucidating the Design Space of Torque-aware Vision-Language-Action Models

TL;DR

This work introduces Torque-aware Vision-Language-Action (VLA) models to bridge the gap between perceptual understanding and physical interaction in manipulation tasks. By systematically exploring where and how to inject torque information, the authors demonstrate that decoder-side torque adapters, a single-token torque history, and a joint action–torque diffusion objective yield robust improvements on contact-rich and regular tasks. The approach shows strong transfer across different VLA backbones and robotic embodiments, indicating broad applicability. The findings provide practical design principles for enriching pretrained VLA models with proprioceptive cues to achieve more reliable and generalizable manipulation policies.

Abstract

Many robotic manipulation tasks require sensing and responding to force signals such as torque to assess whether the task has been successfully completed and to enable closed-loop control. However, current Vision-Language-Action (VLA) models lack the ability to integrate such subtle physical feedback. In this work, we explore Torque-aware VLA models, aiming to bridge this gap by systematically studying the design space for incorporating torque signals into existing VLA architectures. We identify and evaluate several strategies, leading to three key findings. First, introducing torque adapters into the decoder consistently outperforms inserting them into the encoder.Third, inspired by joint prediction and planning paradigms in autonomous driving, we propose predicting torque as an auxiliary output, which further improves performance. This strategy encourages the model to build a physically grounded internal representation of interaction dynamics. Extensive quantitative and qualitative experiments across contact-rich manipulation benchmarks validate our findings.

Paper Structure

This paper contains 36 sections, 12 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: (a) Torque response of the 7-DoF arm during a charger-insertion task. Shaded gray regions mark periods of no contact, where torques remain nearly flat. The orange-tinted segment shows a failed insertion attempt—contact is made but the plug does not enter the socket, producing only small torque fluctuations. The green-tinted segment highlights a successful insertion, characterized by large, distinctive torque spikes as the plug seats fully. (b) Visualization of the 7-DoF robot arm, highlighting joint torque mappings. (c) Design space of torque-based features explored in this work, spanning current, historical, and future signals.
  • Figure 2: Architectures for embedding torque signals.
  • Figure 3: Normalized HSIC values across hidden states from different modality input tokens.
  • Figure 4: Architectures for embedding torque history.
  • Figure 5: Architectures for Action-Torque Diffusion.
  • ...and 9 more figures