Table of Contents
Fetching ...

Learning When to See and When to Feel: Adaptive Vision-Torque Fusion for Contact-Aware Manipulation

Jiuzhou Lei, Chang Liu, Yu She, Xiao Liang, Minghui Zheng

Abstract

Vision-based policies have achieved a good performance in robotic manipulation due to the accessibility and richness of visual observations. However, purely visual sensing becomes insufficient in contact-rich and force-sensitive tasks where force/torque (F/T) signals provide critical information about contact dynamics, alignment, and interaction quality. Although various strategies have been proposed to integrate vision and F/T signals, including auxiliary prediction objectives, mixture-of-experts architectures, and contact-aware gating mechanisms, a comparison of these approaches remains lacking. In this work, we provide a comparison study of different F/T-vision integration strategies within diffusion-based manipulation policies. In addition, we propose an adaptive integration strategy that ignores F/T signals during non-contact phases while adaptively leveraging both vision and torque information during contact. Experimental results demonstrate that our method outperforms the strongest baseline by 14% in success rate, highlighting the importance of contact-aware multimodal fusion for robotic manipulation.

Learning When to See and When to Feel: Adaptive Vision-Torque Fusion for Contact-Aware Manipulation

Abstract

Vision-based policies have achieved a good performance in robotic manipulation due to the accessibility and richness of visual observations. However, purely visual sensing becomes insufficient in contact-rich and force-sensitive tasks where force/torque (F/T) signals provide critical information about contact dynamics, alignment, and interaction quality. Although various strategies have been proposed to integrate vision and F/T signals, including auxiliary prediction objectives, mixture-of-experts architectures, and contact-aware gating mechanisms, a comparison of these approaches remains lacking. In this work, we provide a comparison study of different F/T-vision integration strategies within diffusion-based manipulation policies. In addition, we propose an adaptive integration strategy that ignores F/T signals during non-contact phases while adaptively leveraging both vision and torque information during contact. Experimental results demonstrate that our method outperforms the strongest baseline by 14% in success rate, highlighting the importance of contact-aware multimodal fusion for robotic manipulation.

Paper Structure

This paper contains 10 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure A1: Contact-Aware Manipulation Challenges. The top row shows that relying solely on visual input leaves the policy contact-unaware, causing failures in relevant tasks. The middle row demonstrates that while naively fusing F/T signals with visual features could degrade the policy accuracy during free-space motion. The bottom-left panel shows that the predicted force guidance weight varies interpretably over inference steps, rising during contact phases and falling during free-space motion. We evaluate across three contact-rich tasks (right), and our method achieves 82% average success rate, outperforming all baselines by a substantial margin (bottom).
  • Figure B1: Overview of the Proposed Method. RGB images are encoded using ResNet, with each camera view processed by a separate encoder. Torque signals are encoded using an MLP (the robot state encoder is omitted from the figure for clarity). The resulting torque features are then passed through a contact-gated module, which modulates them based on contact status to produce contact-gated torque features. A scale predictor takes the concatenated vision and torque features as input and outputs a scalar weight that determines the relative influence of torque information when combining the noise predictions from the two modality-specific noise predictors in the diffusion process. The final denoising noise $\hat{\epsilon}$ is obtained by blending these two predictions according to the predicted scale.
  • Figure D1: Visualization of the Three Tasks. First row: egg boiler lid opening, second row: weight-based bottle placement, and third row: twisty connector pull out.
  • Figure D2: Experiment Setup
  • Figure D3: External Joint Torque vs. Time During a Task Execution. The torque measurements at different phases of a task. During the approaching and lifting stage, the torques measurements are patternless and fluctuating. Different colors represent the torque measurement at different joints.
  • ...and 2 more figures