Table of Contents
Fetching ...

Tactile DreamFusion: Exploiting Tactile Sensing for 3D Generation

Ruihan Gao, Kangle Deng, Gengshan Yang, Wenzhen Yuan, Jun-Yan Zhu

TL;DR

This work introduces Tactile DreamFusion, a tactile-augmented pipeline for high-fidelity 3D generation that fuses visual and tactile texture information. By capturing high-resolution tactile normals with GelSight and modeling them in a differentiable 3D texture field, guided by 2D diffusion priors and Texture DreamBooth, the method achieves coherent, fine geometric details and region-wise textures across text-to-3D and image-to-3D tasks. A diffusion-guided refinement framework with multiple loss terms and a multi-part texture scheme yields textures that align with geometry, outperforming state-of-the-art baselines in both texture realism and geometric detail, as demonstrated by user studies. The approach enables customizable and realistic 3D assets and contributes tactile data collection and texture synthesis techniques to 3D generation, with public TouchTexture data and code forthcoming.

Abstract

3D generation methods have shown visually compelling results powered by diffusion image priors. However, they often fail to produce realistic geometric details, resulting in overly smooth surfaces or geometric details inaccurately baked in albedo maps. To address this, we introduce a new method that incorporates touch as an additional modality to improve the geometric details of generated 3D assets. We design a lightweight 3D texture field to synthesize visual and tactile textures, guided by 2D diffusion model priors on both visual and tactile domains. We condition the visual texture generation on high-resolution tactile normals and guide the patch-based tactile texture refinement with a customized TextureDreambooth. We further present a multi-part generation pipeline that enables us to synthesize different textures across various regions. To our knowledge, we are the first to leverage high-resolution tactile sensing to enhance geometric details for 3D generation tasks. We evaluate our method in both text-to-3D and image-to-3D settings. Our experiments demonstrate that our method provides customized and realistic fine geometric textures while maintaining accurate alignment between two modalities of vision and touch.

Tactile DreamFusion: Exploiting Tactile Sensing for 3D Generation

TL;DR

This work introduces Tactile DreamFusion, a tactile-augmented pipeline for high-fidelity 3D generation that fuses visual and tactile texture information. By capturing high-resolution tactile normals with GelSight and modeling them in a differentiable 3D texture field, guided by 2D diffusion priors and Texture DreamBooth, the method achieves coherent, fine geometric details and region-wise textures across text-to-3D and image-to-3D tasks. A diffusion-guided refinement framework with multiple loss terms and a multi-part texture scheme yields textures that align with geometry, outperforming state-of-the-art baselines in both texture realism and geometric detail, as demonstrated by user studies. The approach enables customizable and realistic 3D assets and contributes tactile data collection and texture synthesis techniques to 3D generation, with public TouchTexture data and code forthcoming.

Abstract

3D generation methods have shown visually compelling results powered by diffusion image priors. However, they often fail to produce realistic geometric details, resulting in overly smooth surfaces or geometric details inaccurately baked in albedo maps. To address this, we introduce a new method that incorporates touch as an additional modality to improve the geometric details of generated 3D assets. We design a lightweight 3D texture field to synthesize visual and tactile textures, guided by 2D diffusion model priors on both visual and tactile domains. We condition the visual texture generation on high-resolution tactile normals and guide the patch-based tactile texture refinement with a customized TextureDreambooth. We further present a multi-part generation pipeline that enables us to synthesize different textures across various regions. To our knowledge, we are the first to leverage high-resolution tactile sensing to enhance geometric details for 3D generation tasks. We evaluate our method in both text-to-3D and image-to-3D settings. Our experiments demonstrate that our method provides customized and realistic fine geometric textures while maintaining accurate alignment between two modalities of vision and touch.

Paper Structure

This paper contains 16 sections, 17 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Our method leverages tactile sensing to improve existing 3D generation pipelines. Left: Given a text prompt, we first generate an image using SDXL podell2023sdxl and then run Wonder3D long2023wonder3d to generate mesh from the image. This process often results in a mesh with an overly smooth surface. Right: Our method takes a text prompt and several tactile patches and generates high-fidelity coherent visual and tactile textures that can be transferred to different meshes. Our method can easily adapt to image-to-3D tasks, as shown in the rightmost column, with the reference image's thumbnail displayed at the bottom right corner. Please visit \mywebsitelink for video results.
  • Figure 2: Tactile data capture. We collect one patch by pressing GelSight Mini on an object surface. We use Poisson integration to estimate the contact depth from the sensor output, apply high-pass filtering to extract the high-frequency texture information, and then run the 2D texture synthesis method of Image Quilting efros2001image to obtain an initial texture map. Finally, we convert the height map back to a normal map.
  • Figure 3: TouchTexture dataset. We collect tactile normal data from 18 daily objects featuring diverse tactile textures. To demonstrate the local geometric intricacies, we show the tactile normal map and a 3D height map for each object. Please refer to the supplement for the full set of our data.
  • Figure 4: Method overview. Given an input image or a text prompt, our method generates a mesh with high-quality visual and normal texture. We first generate a base mesh with albedo texture using a text- or image-to-3D method. We use a 3D texture field with hash encoding to represent albedo and tactile normal textures and train it with loss functions on rendered images. To capture the scale differences between visual and tactile modalities, we sample distinct camera views, $P_{\tiny \text{V}}$ for visual rendering and $P_{\tiny \text{T}}$ for tactile rendering. For texture refinement, we train the texture field with a visual matching loss $\mathcal{L}_{\tiny \text{VM}}$, to ensure fidelity to the input mesh, and a visual guidance loss with normal-conditioned ControlNet, $\mathcal{L}_{\tiny \text{VG}}$, to enhance photorealism and cross-modal alignment. We further apply a tactile matching loss, $\mathcal{L}_{\tiny \text{TM}}$, and a tactile guidance loss, $\mathcal{L}_{\tiny \text{TG}}$, using a customized Texture Dreambooth, to achieve high-quality geometric details aligned with the distribution of tactile input V* texture exemplars.
  • Figure 5: 3D generation with a single texture.For each object, we show generated albedo (top), normal (middle), and full color (bottom) renderings from two viewpoints. Our method works for both text-to-3D (corn and football) and image-to-3D (potato and strawberry), generating realistic and coherent visual textures and geometric details. (We use roughness=0.5 when rendering color views in Blender for Figures \ref{['fig:teaser']}, \ref{['fig:results']}, \ref{['fig:transfer']}, and \ref{['fig:multiparts']}.)
  • ...and 10 more figures