Table of Contents
Fetching ...

Tactile-Augmented Radiance Fields

Yiming Dou, Fengyu Yang, Yi Liu, Antonio Loquercio, Andrew Owens

TL;DR

TaRF addresses how to integrate vision and touch within a coherent 3D scene representation. It combines a NeRF-based visual model with a diffusion-based tactile predictor, trained on a large-scale, spatially aligned vision-tactile dataset collected using a camera-mounted touch sensor. The system can render RGB-D views from the TaRF and generate corresponding tactile signals at novel locations, enabling tactile localization and material classification as downstream tasks. The results demonstrate accurate cross-modal synthesis, improved 3D touch localization, and enhanced material understanding, underscoring the potential of tactile-aware scene representations for robotics and immersive virtual environments.

Abstract

We present a scene representation, which we call a tactile-augmented radiance field (TaRF), that brings vision and touch into a shared 3D space. This representation can be used to estimate the visual and tactile signals for a given 3D position within a scene. We capture a scene's TaRF from a collection of photos and sparsely sampled touch probes. Our approach makes use of two insights: (i) common vision-based touch sensors are built on ordinary cameras and thus can be registered to images using methods from multi-view geometry, and (ii) visually and structurally similar regions of a scene share the same tactile features. We use these insights to register touch signals to a captured visual scene, and to train a conditional diffusion model that, provided with an RGB-D image rendered from a neural radiance field, generates its corresponding tactile signal. To evaluate our approach, we collect a dataset of TaRFs. This dataset contains more touch samples than previous real-world datasets, and it provides spatially aligned visual signals for each captured touch signal. We demonstrate the accuracy of our cross-modal generative model and the utility of the captured visual-tactile data on several downstream tasks. Project page: https://dou-yiming.github.io/TaRF

Tactile-Augmented Radiance Fields

TL;DR

TaRF addresses how to integrate vision and touch within a coherent 3D scene representation. It combines a NeRF-based visual model with a diffusion-based tactile predictor, trained on a large-scale, spatially aligned vision-tactile dataset collected using a camera-mounted touch sensor. The system can render RGB-D views from the TaRF and generate corresponding tactile signals at novel locations, enabling tactile localization and material classification as downstream tasks. The results demonstrate accurate cross-modal synthesis, improved 3D touch localization, and enhanced material understanding, underscoring the potential of tactile-aware scene representations for robotics and immersive virtual environments.

Abstract

We present a scene representation, which we call a tactile-augmented radiance field (TaRF), that brings vision and touch into a shared 3D space. This representation can be used to estimate the visual and tactile signals for a given 3D position within a scene. We capture a scene's TaRF from a collection of photos and sparsely sampled touch probes. Our approach makes use of two insights: (i) common vision-based touch sensors are built on ordinary cameras and thus can be registered to images using methods from multi-view geometry, and (ii) visually and structurally similar regions of a scene share the same tactile features. We use these insights to register touch signals to a captured visual scene, and to train a conditional diffusion model that, provided with an RGB-D image rendered from a neural radiance field, generates its corresponding tactile signal. To evaluate our approach, we collect a dataset of TaRFs. This dataset contains more touch samples than previous real-world datasets, and it provides spatially aligned visual signals for each captured touch signal. We demonstrate the accuracy of our cross-modal generative model and the utility of the captured visual-tactile data on several downstream tasks. Project page: https://dou-yiming.github.io/TaRF
Paper Structure (17 sections, 1 equation, 7 figures, 5 tables)

This paper contains 17 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Tactile-augmented radiance fields. We capture a tactile-augmented radiance field (TaRF) from photos and sparsely sampled touch probes. To do this, we register the captured visual and tactile signals into a shared 3D space, then train a diffusion model to impute touch at other locations within the scene. Here, we visualize two touch probes and their (color coded) 3D positions in the scene. We also show two touch signals estimated by the diffusion model. The touch signals were collected using a vision-based touch sensor lambeta2020digit that represents the touch signals as images. Please see our https://dou-yiming.github.io/TaRF for video results.
  • Figure 2: Visual-tactile examples. In contrast to the visual-tactile data captured in previous work, our approach allows us to sample unobstructed images that are spatially aligned with the touch signal, from arbitrary 3D viewpoints using a NeRF.
  • Figure 3: Capturing setup. (a) We record paired vision and touch signals using a camera attached to a touch sensor. (b) We estimate the relative pose between the touch sensor and the camera using correspondences between sight and touch.
  • Figure 4: Touch estimation. We estimate the tactile signal for a given touch sensor pose $(\mathbf{R}, \mathbf{t})$. To do this, we synthesize a viewpoint from the NeRF, along with a depth map. We use conditional latent diffusion to predict the tactile signal from these inputs.
  • Figure 5: Representative examples from the captured dataset. Our dataset is obtained from nine everyday scenes, such as offices, classrooms, and kitchens. We show three such scenes in the figure above, together with samples of spatially aligned visual and tactile data. In each scene, 1k to 2k tactile probes were collected, resulting in a total of 19.3k image pairs. The data encompasses diverse geometries (edges, surfaces, corners, etc.) and textures (plastic, clothes, snow, wood, etc.) of various materials. The collector systematically probed different objects, covering areas with distinct geometry and texture using different sensor poses.
  • ...and 2 more figures