Table of Contents
Fetching ...

Towards Generalization of Tactile Image Generation: Reference-Free Evaluation in a Leakage-Free Setting

Cagri Gungor, Derek Eppinger, Adriana Kovashka

TL;DR

This work addresses the generalization gap in tactile image generation caused by data leakage and reliance on reference-based metrics. It introduces a leakage-free evaluation protocol and four reference-free metrics (TMMD, I-TMMD, CI-TMMD, D-TMMD) built on a dedicated tactile encoder, enabling robust assessment of tactile fidelity and diversity. It also presents a text-conditioned latent diffusion model that uses concise material descriptions during training to guide vision-to-touch generation, producing tactile images from visual inputs at inference. Experiments on leakage-free TaG-NoLeak and HCT-NoLeak splits, with human evaluation, demonstrate improved fidelity, internal consistency, and class separation, highlighting the need for leakage-aware benchmarks and specialized tactile metrics for reliable generalization in multimodal sensing.

Abstract

Tactile sensing, which relies on direct physical contact, is critical for human perception and underpins applications in computer vision, robotics, and multimodal learning. Because tactile data is often scarce and costly to acquire, generating synthetic tactile images provides a scalable solution to augment real-world measurements. However, ensuring robust generalization in synthesizing tactile images-capturing subtle, material-specific contact features-remains challenging. We demonstrate that overlapping training and test samples in commonly used datasets inflate performance metrics, obscuring the true generalizability of tactile models. To address this, we propose a leakage-free evaluation protocol coupled with novel, reference-free metrics-TMMD, I-TMMD, CI-TMMD, and D-TMMD-tailored for tactile generation. Moreover, we propose a vision-to-touch generation method that leverages text as an intermediate modality by incorporating concise, material-specific descriptions during training to better capture essential tactile features. Experiments on two popular visuo-tactile datasets, Touch and Go and HCT, show that our approach achieves superior performance and enhanced generalization in a leakage-free setting.

Towards Generalization of Tactile Image Generation: Reference-Free Evaluation in a Leakage-Free Setting

TL;DR

This work addresses the generalization gap in tactile image generation caused by data leakage and reliance on reference-based metrics. It introduces a leakage-free evaluation protocol and four reference-free metrics (TMMD, I-TMMD, CI-TMMD, D-TMMD) built on a dedicated tactile encoder, enabling robust assessment of tactile fidelity and diversity. It also presents a text-conditioned latent diffusion model that uses concise material descriptions during training to guide vision-to-touch generation, producing tactile images from visual inputs at inference. Experiments on leakage-free TaG-NoLeak and HCT-NoLeak splits, with human evaluation, demonstrate improved fidelity, internal consistency, and class separation, highlighting the need for leakage-aware benchmarks and specialized tactile metrics for reliable generalization in multimodal sensing.

Abstract

Tactile sensing, which relies on direct physical contact, is critical for human perception and underpins applications in computer vision, robotics, and multimodal learning. Because tactile data is often scarce and costly to acquire, generating synthetic tactile images provides a scalable solution to augment real-world measurements. However, ensuring robust generalization in synthesizing tactile images-capturing subtle, material-specific contact features-remains challenging. We demonstrate that overlapping training and test samples in commonly used datasets inflate performance metrics, obscuring the true generalizability of tactile models. To address this, we propose a leakage-free evaluation protocol coupled with novel, reference-free metrics-TMMD, I-TMMD, CI-TMMD, and D-TMMD-tailored for tactile generation. Moreover, we propose a vision-to-touch generation method that leverages text as an intermediate modality by incorporating concise, material-specific descriptions during training to better capture essential tactile features. Experiments on two popular visuo-tactile datasets, Touch and Go and HCT, show that our approach achieves superior performance and enhanced generalization in a leakage-free setting.

Paper Structure

This paper contains 15 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: An illustration of how our reference-free metrics evaluate generated tactile images across different classes. Within each class (e.g., “Concrete” or “Grass”), samples are split into two subsets (A and B) to measure intra-class consistency, while inter-class divergence quantifies how distinct the generated samples are between classes.
  • Figure 2: Overview of our latent diffusion-based tactile image generation pipeline: visual inputs are enriched with text cues during training to guide the generation of tactile images via latent diffusion. At inference, only the visual image is required.
  • Figure 3: TMMD compares generated and reference tactile features via a dedicated tactile encoder, while I-TMMD provides a reference-free measure of internal consistency by comparing two disjoint subsets of generated samples.
  • Figure 4: Samples from both training and test sets of the Touch and Go, HCT and SSVTP datasets. The near-identical appearance of the vision-tactile pairs highlights the severe data leakage present in the original splits.
  • Figure 5: This figure compares Ours and Ours w/ BG using traditional pairwise metrics (SSIM, PSNR, LPIPS) to evaluate similarity to the Reference Tactile. Ours accurately captures material-specific features, such as line-shaped patterns for "grass" (green arrows), while Ours w/ BG incorporates irrelevant background details (blue boxes) using Background Tactile and introduces errors, like pebbly patterns resembling "concrete" (red arrows). Despite inflated higher performance for Ours w/ BG, these metrics reward irrelevant background details, underscoring their limitations, as Ours better prioritizes relevant material-specific details.
  • ...and 1 more figures