Table of Contents
Fetching ...

TextToucher: Fine-Grained Text-to-Touch Generation

Jiahang Tu, Hao Fu, Fengyu Yang, Hanbin Zhao, Chao Zhang, Hui Qian

TL;DR

TextToucher introduces a first-of-its-kind text-to-touch generation framework that leverages fine-grained textual descriptions to synthesize tactile images. By modeling object-level (texture, shape) and sensor-level (gel status) information and fusing them through a diffusion-transformer with dual-grain conditioning, it significantly outperforms vision-conditioned baselines. The approach is supported by LLM-based annotations, learnable gel-status prompts, and theCTTP contrastive metric, enabling accurate alignment between generated tactile data and text prompts. This work advances tactile data generation for embodied AI and multimodal models, enabling cost-effective, high-fidelity tactile data synthesis for training and evaluation.

Abstract

Tactile sensation plays a crucial role in the development of multi-modal large models and embodied intelligence. To collect tactile data with minimal cost as possible, a series of studies have attempted to generate tactile images by vision-to-touch image translation. However, compared to text modality, visual modality-driven tactile generation cannot accurately depict human tactile sensation. In this work, we analyze the characteristics of tactile images in detail from two granularities: object-level (tactile texture, tactile shape), and sensor-level (gel status). We model these granularities of information through text descriptions and propose a fine-grained Text-to-Touch generation method (TextToucher) to generate high-quality tactile samples. Specifically, we introduce a multimodal large language model to build the text sentences about object-level tactile information and employ a set of learnable text prompts to represent the sensor-level tactile information. To better guide the tactile generation process with the built text information, we fuse the dual grains of text information and explore various dual-grain text conditioning methods within the diffusion transformer architecture. Furthermore, we propose a Contrastive Text-Touch Pre-training (CTTP) metric to precisely evaluate the quality of text-driven generated tactile data. Extensive experiments demonstrate the superiority of our TextToucher method. The source codes will be available at \url{https://github.com/TtuHamg/TextToucher}.

TextToucher: Fine-Grained Text-to-Touch Generation

TL;DR

TextToucher introduces a first-of-its-kind text-to-touch generation framework that leverages fine-grained textual descriptions to synthesize tactile images. By modeling object-level (texture, shape) and sensor-level (gel status) information and fusing them through a diffusion-transformer with dual-grain conditioning, it significantly outperforms vision-conditioned baselines. The approach is supported by LLM-based annotations, learnable gel-status prompts, and theCTTP contrastive metric, enabling accurate alignment between generated tactile data and text prompts. This work advances tactile data generation for embodied AI and multimodal models, enabling cost-effective, high-fidelity tactile data synthesis for training and evaluation.

Abstract

Tactile sensation plays a crucial role in the development of multi-modal large models and embodied intelligence. To collect tactile data with minimal cost as possible, a series of studies have attempted to generate tactile images by vision-to-touch image translation. However, compared to text modality, visual modality-driven tactile generation cannot accurately depict human tactile sensation. In this work, we analyze the characteristics of tactile images in detail from two granularities: object-level (tactile texture, tactile shape), and sensor-level (gel status). We model these granularities of information through text descriptions and propose a fine-grained Text-to-Touch generation method (TextToucher) to generate high-quality tactile samples. Specifically, we introduce a multimodal large language model to build the text sentences about object-level tactile information and employ a set of learnable text prompts to represent the sensor-level tactile information. To better guide the tactile generation process with the built text information, we fuse the dual grains of text information and explore various dual-grain text conditioning methods within the diffusion transformer architecture. Furthermore, we propose a Contrastive Text-Touch Pre-training (CTTP) metric to precisely evaluate the quality of text-driven generated tactile data. Extensive experiments demonstrate the superiority of our TextToucher method. The source codes will be available at \url{https://github.com/TtuHamg/TextToucher}.
Paper Structure (36 sections, 7 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 36 sections, 7 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: We present tactile images captured by sensors under different gel statuses. In our opinion, each tactile image contains three types of information: tactile texture, tactile shape, and gel status.
  • Figure 2: Left: Our proposed TextToucher utilizes text modality to obtain tactile texture, tactile shape and gel status information. We employ LLaVA, a vision-language large model, to caption the shape information in tactile images. Combining with texture descriptions from tactile datasets, we encode them with a text encoder. Additionally, we define a set of special word tokens to represent gel status information. Right: We train a tactile encoder using a contrastive loss function. In the shared space of text and tactile modalities, we propose a metric called CTTP, which uses cosine similarity to represent the relationship between tactile images and text descriptions. Our metric aims to effectively evaluate the quality of text-conditioned tactile image generation.
  • Figure 3: LLaVA engages in a step-by-step reasoning process based on carefully designed questions to achieve accurate data annotation.
  • Figure 4: We compare our approach with other representative methods. TextToucher can produce tactile images with fewer artifacts and higher quality, closely aligning with the provided text descriptions.
  • Figure 5: The first row displays the gel statuses contained in HCT dataset. We generate the same object under different gel statuses in the remaining rows.
  • ...and 5 more figures