Table of Contents
Fetching ...

Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation

Yuyang Li, Yinghan Chen, Zihang Zhao, Puhao Li, Tengyu Liu, Siyuan Huang, Yixin Zhu

TL;DR

This work tackles the need for true simultaneous tactile-visual perception in robotic manipulation by introducing TacThru, a see-through sensor built on a transparent elastomer with persistent illumination and robust keyline markers. It integrates TacThru into TacThru-UMI, a diffusion-policy-based imitation-learning framework that fuses visual, tactile, and proprioceptive signals via a Transformer backbone. The approach is validated across five real-world tasks, achieving an average success rate of 85.5% and outperforming vision-only and alternating tactile-visual baselines, with notable gains in handling thin/soft objects and multimodal coordination. The results demonstrate that simultaneous multimodal perception, when combined with modern learning frameworks, enables more precise, adaptable, and robust robotic manipulation pipelines.

Abstract

Robotic manipulation requires both rich multimodal perception and effective learning frameworks to handle complex real-world tasks. See-through-skin (STS) sensors, which combine tactile and visual perception, offer promising sensing capabilities, while modern imitation learning provides powerful tools for policy acquisition. However, existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking. Furthermore, integrating these rich multimodal signals into learning-based manipulation pipelines remains an open challenge. We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction, and TacThru-UMI, an imitation learning framework that leverages these multimodal signals for manipulation. Our sensor features a fully transparent elastomer, persistent illumination, novel keyline markers, and efficient tracking, while our learning system integrates these signals through a Transformer-based Diffusion Policy. Experiments on five challenging real-world tasks show that TacThru-UMI achieves an average success rate of 85.5%, significantly outperforming the baselines of alternating tactile-visual (66.3%) and vision-only (55.4%). The system excels in critical scenarios, including contact detection with thin and soft objects and precision manipulation requiring multimodal coordination. This work demonstrates that combining simultaneous multimodal perception with modern learning frameworks enables more precise, adaptable robotic manipulation.

Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation

TL;DR

This work tackles the need for true simultaneous tactile-visual perception in robotic manipulation by introducing TacThru, a see-through sensor built on a transparent elastomer with persistent illumination and robust keyline markers. It integrates TacThru into TacThru-UMI, a diffusion-policy-based imitation-learning framework that fuses visual, tactile, and proprioceptive signals via a Transformer backbone. The approach is validated across five real-world tasks, achieving an average success rate of 85.5% and outperforming vision-only and alternating tactile-visual baselines, with notable gains in handling thin/soft objects and multimodal coordination. The results demonstrate that simultaneous multimodal perception, when combined with modern learning frameworks, enables more precise, adaptable, and robust robotic manipulation pipelines.

Abstract

Robotic manipulation requires both rich multimodal perception and effective learning frameworks to handle complex real-world tasks. See-through-skin (STS) sensors, which combine tactile and visual perception, offer promising sensing capabilities, while modern imitation learning provides powerful tools for policy acquisition. However, existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking. Furthermore, integrating these rich multimodal signals into learning-based manipulation pipelines remains an open challenge. We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction, and TacThru-UMI, an imitation learning framework that leverages these multimodal signals for manipulation. Our sensor features a fully transparent elastomer, persistent illumination, novel keyline markers, and efficient tracking, while our learning system integrates these signals through a Transformer-based Diffusion Policy. Experiments on five challenging real-world tasks show that TacThru-UMI achieves an average success rate of 85.5%, significantly outperforming the baselines of alternating tactile-visual (66.3%) and vision-only (55.4%). The system excels in critical scenarios, including contact detection with thin and soft objects and precision manipulation requiring multimodal coordination. This work demonstrates that combining simultaneous multimodal perception with modern learning frameworks enables more precise, adaptable robotic manipulation.

Paper Structure

This paper contains 17 sections, 3 equations, 7 figures.

Figures (7)

  • Figure 1: Fabrication of the TacThru sensor and integration into the TacThru-UMI system. (a) The keyline marker elastomer is fabricated by sequentially spraying inner (black) and outer (white) markers on transparent elastomer using laser-cut masks. (b) The TacThru sensor features an extended linkage that serves as gripper fingers. (c) The TacThru-UMI platform includes a robot end-effector (left) and a data collector (right) that share identical body and finger designs, with the fingers actuated by an Inspire LAS30-021D servo electric cylinder with a maximum opening width of 72mm.
  • Figure 2: Keyline marker design and filtering enable robust tracking. (a) Evaluation setup compares two sensor types (keyline solid markers) during bottle grasping tasks. (b) The TacThru sensor view comparison shows that keyline markers (left) remain distinct against complex backgrounds, while solid markers (right) become invisible. (c) Quantitative results demonstrate our filtered keyline method achieves stable tracking of all 64 markers while keeping efficiency (6.08ms processing time per frame), while solid markers suffer missed detections and unfiltered keyline detection produces false positives (count > 64).
  • Figure 3: Diffusion policy architecture for TacThru-UMI. Multimodal observations---wrist-camera RGB images, sensor RGB images, detected marker deviations, and proprioception---are encoded into tokens and augmented with positional and modality-specific embeddings. These tokens condition a Transformer-based diffusion policy that denoises Gaussian noise into action chunks for robot execution. The example shows how the policy leverages the TacThru's close-up view to align the cap and mount during the InsertCap task.
  • Figure 4: Task demonstrations across five manipulation scenarios. (a) PickBottle: basic pick-and-place, (b) PullTissue: thin-and-soft object manipulation, (c) SortBolt: visual discrimination, (d) HangScissors: tactile discrimination, (e) InsertCap: multimodal fusion. Top: Initial object configurations. Middle: Wrist-camera view progression during demonstration. Bottom: Corresponding TacThru (top) and GelSight sensor (bottom) observations, illustrating distinct sensing modalities and information content.
  • Figure 5: Quantitative results across manipulation tasks and sensing modalities. Success rates for four policy variants: TT-M (TacThru with markers), TT (TacThru only), GS-M (GelSight with markers), and Wrist (vision-only). Each task evaluates specific sensing capabilities: basic manipulation (PickBottle), thin-and-soft object manipulation (PullTissue), visual discrimination (SortBolt), tactile discrimination (HangScissors), and multimodal fusion (InsertCap). Error bars show standard deviation across evaluation runs. The rightmost column presents overall performance averages.
  • ...and 2 more figures