Table of Contents
Fetching ...

FlowTouch: View-Invariant Visuo-Tactile Prediction

Seongjin Bien, Carlo Kneissl, Tobias Jülg, Frank Fundel, Thomas Ressler-Antal, Florian Walter, Björn Ommer, Gitta Kutyniok, Wolfram Burgard

TL;DR

This work introduces FlowTouch, a novel model for view-invariant visuo-tactile prediction that uses an object's local 3D mesh to encode rich information for predicting tactile patterns while abstracting away from scene-dependent details.

Abstract

Tactile sensation is essential for contact-rich manipulation tasks. It provides direct feedback on object geometry, surface properties, and interaction forces, enhancing perception and enabling fine-grained control. An inherent limitation of tactile sensors is that readings are available only when an object is touched. This precludes their use during planning and the initial execution phase of a task. Predicting tactile information from visual information can bridge this gap. A common approach is to learn a direct mapping from camera images to the output of vision-based tactile sensors. However, the resulting model will depend strongly on the specific setup and on how well the camera can capture the area where an object is touched. In this work, we introduce FlowTouch, a novel model for view-invariant visuo-tactile prediction. Our key idea is to use an object's local 3D mesh to encode rich information for predicting tactile patterns while abstracting away from scene-dependent details. FlowTouch integrates scene reconstruction and Flow Matching-based models for image generation. Our results show that FlowTouch is able to bridge the sim-to-real gap and generalize to new sensor instances. We further show that the resulting tactile images can be used for downstream grasp stability prediction. Our code, datasets and videos are available at https://flowtouch.github.io/

FlowTouch: View-Invariant Visuo-Tactile Prediction

TL;DR

This work introduces FlowTouch, a novel model for view-invariant visuo-tactile prediction that uses an object's local 3D mesh to encode rich information for predicting tactile patterns while abstracting away from scene-dependent details.

Abstract

Tactile sensation is essential for contact-rich manipulation tasks. It provides direct feedback on object geometry, surface properties, and interaction forces, enhancing perception and enabling fine-grained control. An inherent limitation of tactile sensors is that readings are available only when an object is touched. This precludes their use during planning and the initial execution phase of a task. Predicting tactile information from visual information can bridge this gap. A common approach is to learn a direct mapping from camera images to the output of vision-based tactile sensors. However, the resulting model will depend strongly on the specific setup and on how well the camera can capture the area where an object is touched. In this work, we introduce FlowTouch, a novel model for view-invariant visuo-tactile prediction. Our key idea is to use an object's local 3D mesh to encode rich information for predicting tactile patterns while abstracting away from scene-dependent details. FlowTouch integrates scene reconstruction and Flow Matching-based models for image generation. Our results show that FlowTouch is able to bridge the sim-to-real gap and generalize to new sensor instances. We further show that the resulting tactile images can be used for downstream grasp stability prediction. Our code, datasets and videos are available at https://flowtouch.github.io/
Paper Structure (22 sections, 7 equations, 7 figures, 4 tables)

This paper contains 22 sections, 7 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: FlowTouch at a glance: The robot first looks at the object and creates a mesh with scene mesh generation foundation models. Given the particular touch point on the mesh and the static background image of the tactile sensor, FlowTouch then predicts the resulting tactile image.
  • Figure 2: The FlowTouch architecture: The blue area shows the image-to-PCN sampling pipeline, while the green area shows the generative model's structure. Snowflakes indicate components with frozen weights.
  • Figure 3: Left: A selection of primitive geometries used for generating the simulation data. Right: Taxim GelSight images rendered from a single contact point (corner of a cube), with local translations and rotations.
  • Figure 4: Tactile predictions of the model ablations listed in \ref{['tab:training_comparison']}. Samples are taken from the validation dataset.
  • Figure 5: Left: Objects used for data collection. Center: Robot setup. Right: Collected grasp poses aligned to generated mesh, with unique color for each finger pair.
  • ...and 2 more figures