Table of Contents
Fetching ...

MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose

Sirine Bhouri, Lan Wei, Jian-Qing Zheng, Dandan Zhang

TL;DR

MultiDiffSense tackles the data bottleneck in visuo-tactile robotics by introducing a diffusion-based, dual-conditioned model that jointly generates aligned images for ViTac, TacTip, and ViTacTip within a single architecture. It leverages pose-aligned depth maps as geometric conditioning and structured text prompts to specify sensor modality and 4-DoF contact pose, enabling controllable, physically consistent cross-modal synthesis. Empirical results show substantial improvements over a Pix2Pix baseline across seen/unseen objects and poses, and downstream pose estimation with mixed real-synthetic data approaches real-data performance, highlighting practical data-augmentation benefits. The work paves the way for scalable multi-modal tactile datasets, cross-sensor transfer, and flexible deployment in robotics, with future directions including broader object sets, richer geometry, and dynamic contact modeling.

Abstract

Acquiring aligned visuo-tactile datasets is slow and costly, requiring specialised hardware and large-scale data collection. Synthetic generation is promising, but prior methods are typically single-modality, limiting cross-modal learning. We present MultiDiffSense, a unified diffusion model that synthesises images for multiple vision-based tactile sensors (ViTac, TacTip, ViTacTip) within a single architecture. Our approach uses dual conditioning on CAD-derived, pose-aligned depth maps and structured prompts that encode sensor type and 4-DoF contact pose, enabling controllable, physically consistent multi-modal synthesis. Evaluating on 8 objects (5 seen, 3 novel) and unseen poses, MultiDiffSense outperforms a Pix2Pix cGAN baseline in SSIM by +36.3% (ViTac), +134.6% (ViTacTip), and +64.7% (TacTip). For downstream 3-DoF pose estimation, mixing 50% synthetic with 50% real halves the required real data while maintaining competitive performance. MultiDiffSense alleviates the data-collection bottleneck in tactile sensing and enables scalable, controllable multi-modal dataset generation for robotic applications.

MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose

TL;DR

MultiDiffSense tackles the data bottleneck in visuo-tactile robotics by introducing a diffusion-based, dual-conditioned model that jointly generates aligned images for ViTac, TacTip, and ViTacTip within a single architecture. It leverages pose-aligned depth maps as geometric conditioning and structured text prompts to specify sensor modality and 4-DoF contact pose, enabling controllable, physically consistent cross-modal synthesis. Empirical results show substantial improvements over a Pix2Pix baseline across seen/unseen objects and poses, and downstream pose estimation with mixed real-synthetic data approaches real-data performance, highlighting practical data-augmentation benefits. The work paves the way for scalable multi-modal tactile datasets, cross-sensor transfer, and flexible deployment in robotics, with future directions including broader object sets, richer geometry, and dynamic contact modeling.

Abstract

Acquiring aligned visuo-tactile datasets is slow and costly, requiring specialised hardware and large-scale data collection. Synthetic generation is promising, but prior methods are typically single-modality, limiting cross-modal learning. We present MultiDiffSense, a unified diffusion model that synthesises images for multiple vision-based tactile sensors (ViTac, TacTip, ViTacTip) within a single architecture. Our approach uses dual conditioning on CAD-derived, pose-aligned depth maps and structured prompts that encode sensor type and 4-DoF contact pose, enabling controllable, physically consistent multi-modal synthesis. Evaluating on 8 objects (5 seen, 3 novel) and unseen poses, MultiDiffSense outperforms a Pix2Pix cGAN baseline in SSIM by +36.3% (ViTac), +134.6% (ViTacTip), and +64.7% (TacTip). For downstream 3-DoF pose estimation, mixing 50% synthetic with 50% real halves the required real data while maintaining competitive performance. MultiDiffSense alleviates the data-collection bottleneck in tactile sensing and enables scalable, controllable multi-modal dataset generation for robotic applications.
Paper Structure (25 sections, 3 equations, 5 figures, 5 tables)

This paper contains 25 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Framework Overview. The model takes a CAD file and textual prompt as inputs. The CAD model is converted into a pose-aligned depth map (control image) fed via zero-convolutions into the ControlNet branch as the geometric condition. The text prompt is encoded with CLIP and injected into the UNet via cross-attention. The decoder then refines the latents based on both conditions to generate an image reflecting the desired object geometry, contact pose, and sensor modality.
  • Figure 2: Control Image Processing Pipeline. The pipeline takes an STL file, target image and a CSV log of end-effector poses (pose annotations) as inputs and consists of four stages: (1) Use STL file to render depth map and preprocess it to extract clean object masks; (2) Align robot coordinates to image pixels via centroid mapping; (3) Scale XY translations using workspace calibration, Incorporate Z-axis depth through geometric scaling and intensity modulation, and Apply yaw rotation using 2D rotation matrices; (4) Centre alignment error is minimised to $<$ 5 pixels ($\approx$0.6 mm)
  • Figure 3: Example of Structured Textual Prompt
  • Figure 4: Visualisation of image generation result on unseen objects across three tactile sensor modalities (ViTacTip, ViTac, TacTip). Red dashed boxes highlight regions where the methods differ: MultiDiffSense better preserves contact geometry, marker patterns, and lighting.
  • Figure 5: Effect of prompt length on reconstruction quality. Real vs Generated images by the two different model variants under the two testing scenarios (seen object but unseen poses and unseen objects).