MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose

Sirine Bhouri; Lan Wei; Jian-Qing Zheng; Dandan Zhang

MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose

Sirine Bhouri, Lan Wei, Jian-Qing Zheng, Dandan Zhang

TL;DR

MultiDiffSense tackles the data bottleneck in visuo-tactile robotics by introducing a diffusion-based, dual-conditioned model that jointly generates aligned images for ViTac, TacTip, and ViTacTip within a single architecture. It leverages pose-aligned depth maps as geometric conditioning and structured text prompts to specify sensor modality and 4-DoF contact pose, enabling controllable, physically consistent cross-modal synthesis. Empirical results show substantial improvements over a Pix2Pix baseline across seen/unseen objects and poses, and downstream pose estimation with mixed real-synthetic data approaches real-data performance, highlighting practical data-augmentation benefits. The work paves the way for scalable multi-modal tactile datasets, cross-sensor transfer, and flexible deployment in robotics, with future directions including broader object sets, richer geometry, and dynamic contact modeling.

Abstract

Acquiring aligned visuo-tactile datasets is slow and costly, requiring specialised hardware and large-scale data collection. Synthetic generation is promising, but prior methods are typically single-modality, limiting cross-modal learning. We present MultiDiffSense, a unified diffusion model that synthesises images for multiple vision-based tactile sensors (ViTac, TacTip, ViTacTip) within a single architecture. Our approach uses dual conditioning on CAD-derived, pose-aligned depth maps and structured prompts that encode sensor type and 4-DoF contact pose, enabling controllable, physically consistent multi-modal synthesis. Evaluating on 8 objects (5 seen, 3 novel) and unseen poses, MultiDiffSense outperforms a Pix2Pix cGAN baseline in SSIM by +36.3% (ViTac), +134.6% (ViTacTip), and +64.7% (TacTip). For downstream 3-DoF pose estimation, mixing 50% synthetic with 50% real halves the required real data while maintaining competitive performance. MultiDiffSense alleviates the data-collection bottleneck in tactile sensing and enables scalable, controllable multi-modal dataset generation for robotic applications.

MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose

TL;DR

Abstract

Paper Structure (25 sections, 3 equations, 5 figures, 5 tables)

This paper contains 25 sections, 3 equations, 5 figures, 5 tables.

Introduction
Related Work
Single-Output Tactile Image Generation
Conditional GANs
Conditional Diffusion Models
Multi-Modal Tactile Image Generation
Methods
Preliminary
Model Architecture
Data Conditioning Pipeline
Control Image Generation
Textual Prompt Generation
Experiments and Results
Dataset Introduction
Experimental Setup
...and 10 more sections

Figures (5)

Figure 1: Framework Overview. The model takes a CAD file and textual prompt as inputs. The CAD model is converted into a pose-aligned depth map (control image) fed via zero-convolutions into the ControlNet branch as the geometric condition. The text prompt is encoded with CLIP and injected into the UNet via cross-attention. The decoder then refines the latents based on both conditions to generate an image reflecting the desired object geometry, contact pose, and sensor modality.
Figure 2: Control Image Processing Pipeline. The pipeline takes an STL file, target image and a CSV log of end-effector poses (pose annotations) as inputs and consists of four stages: (1) Use STL file to render depth map and preprocess it to extract clean object masks; (2) Align robot coordinates to image pixels via centroid mapping; (3) Scale XY translations using workspace calibration, Incorporate Z-axis depth through geometric scaling and intensity modulation, and Apply yaw rotation using 2D rotation matrices; (4) Centre alignment error is minimised to $<$ 5 pixels ($\approx$0.6 mm)
Figure 3: Example of Structured Textual Prompt
Figure 4: Visualisation of image generation result on unseen objects across three tactile sensor modalities (ViTacTip, ViTac, TacTip). Red dashed boxes highlight regions where the methods differ: MultiDiffSense better preserves contact geometry, marker patterns, and lighting.
Figure 5: Effect of prompt length on reconstruction quality. Real vs Generated images by the two different model variants under the two testing scenarios (seen object but unseen poses and unseen objects).

MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose

TL;DR

Abstract

MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose

Authors

TL;DR

Abstract

Table of Contents

Figures (5)