Table of Contents
Fetching ...

RA-Touch: Retrieval-Augmented Touch Understanding with Enriched Visual Data

Yoorhim Cho, Hongyeob Kim, Semin Kim, Youjia Zhang, Yunseok Choi, Sungeun Hong

TL;DR

RA-Touch addresses the scarcity of tactile data by leveraging tactile semantics learned from recaptioned visual data. It introduces ImageNet-T, a tactile-focused vision-language dataset, and two key modules: the Tactile-Guided Retriever, which generates tactile-aware queries from visual and tactile features to retrieve semantically aligned examples, and the Texture-Aware Integrator, which fuses retrieved cues to produce texture-grounded tactile descriptions. The framework is built on TVL-LLaMA and achieves state-of-the-art results on the TVL benchmark, demonstrating strong open-vocabulary tactile reasoning without direct tactile supervision. By showing that retrieval-augmented visual knowledge can ground tactile understanding, RA-Touch offers a scalable and data-efficient approach with potential impact on robotics and embodied AI.

Abstract

Visuo-tactile perception aims to understand an object's tactile properties, such as texture, softness, and rigidity. However, the field remains underexplored because collecting tactile data is costly and labor-intensive. We observe that visually distinct objects can exhibit similar surface textures or material properties. For example, a leather sofa and a leather jacket have different appearances but share similar tactile properties. This implies that tactile understanding can be guided by material cues in visual data, even without direct tactile supervision. In this paper, we introduce RA-Touch, a retrieval-augmented framework that improves visuo-tactile perception by leveraging visual data enriched with tactile semantics. We carefully recaption a large-scale visual dataset with tactile-focused descriptions, enabling the model to access tactile semantics typically absent from conventional visual datasets. A key challenge remains in effectively utilizing these tactile-aware external descriptions. RA-Touch addresses this by retrieving visual-textual representations aligned with tactile inputs and integrating them to focus on relevant textural and material properties. By outperforming prior methods on the TVL benchmark, our method demonstrates the potential of retrieval-based visual reuse for tactile understanding. Code is available at https://aim-skku.github.io/RA-Touch

RA-Touch: Retrieval-Augmented Touch Understanding with Enriched Visual Data

TL;DR

RA-Touch addresses the scarcity of tactile data by leveraging tactile semantics learned from recaptioned visual data. It introduces ImageNet-T, a tactile-focused vision-language dataset, and two key modules: the Tactile-Guided Retriever, which generates tactile-aware queries from visual and tactile features to retrieve semantically aligned examples, and the Texture-Aware Integrator, which fuses retrieved cues to produce texture-grounded tactile descriptions. The framework is built on TVL-LLaMA and achieves state-of-the-art results on the TVL benchmark, demonstrating strong open-vocabulary tactile reasoning without direct tactile supervision. By showing that retrieval-augmented visual knowledge can ground tactile understanding, RA-Touch offers a scalable and data-efficient approach with potential impact on robotics and embodied AI.

Abstract

Visuo-tactile perception aims to understand an object's tactile properties, such as texture, softness, and rigidity. However, the field remains underexplored because collecting tactile data is costly and labor-intensive. We observe that visually distinct objects can exhibit similar surface textures or material properties. For example, a leather sofa and a leather jacket have different appearances but share similar tactile properties. This implies that tactile understanding can be guided by material cues in visual data, even without direct tactile supervision. In this paper, we introduce RA-Touch, a retrieval-augmented framework that improves visuo-tactile perception by leveraging visual data enriched with tactile semantics. We carefully recaption a large-scale visual dataset with tactile-focused descriptions, enabling the model to access tactile semantics typically absent from conventional visual datasets. A key challenge remains in effectively utilizing these tactile-aware external descriptions. RA-Touch addresses this by retrieving visual-textual representations aligned with tactile inputs and integrating them to focus on relevant textural and material properties. By outperforming prior methods on the TVL benchmark, our method demonstrates the potential of retrieval-based visual reuse for tactile understanding. Code is available at https://aim-skku.github.io/RA-Touch

Paper Structure

This paper contains 34 sections, 7 equations, 26 figures, 10 tables.

Figures (26)

  • Figure 1: RA-Touch motivation. Objects with different appearances can share similar tactile properties. RA-Touch leverages this observation by retrieving texture-relevant examples from ImageNet-T, which recaptions existing visual data with tactile-focused descriptions. This enables tactile inference without collecting additional tactile data, even when conventional VLMs fail to provide meaningful responses.
  • Figure 2: Overview of RA-Touch. We first construct ImageNet-T, a vision-language dataset recaptioned with tactile-focused descriptions using VLMs conditioned on the image, class name, and visual caption. Given RGB and tactile inputs, the Tactile-Guided Retriever selects the top-$K$ relevant samples from ImageNet-T based on visuo-tactile similarity. These samples are processed by the Texture-Aware Integrator, which extracts texture-relevant cues and combines them with the input tactile embedding to produce an augmented representation. This is fused with the original visual prompt to form a retrieval-augmented prompt for LLaMA, enabling tactile description generation in a parameter-efficient manner.
  • Figure 3: Performance comparisons across different subset sizes of ImageNet-T (10k, 50k, 100k, 150k) on three datasets: SSVTP, HCT, and TVL.
  • Figure 4: Retrieval results with visual or tactile features. (a) Image-to-Image retrieves polished surface objects but lacks physical texture. (b) Tactile-to-Text focuses on text alone, retrieving a drawing of an abacus as Top-1.
  • Figure 5: Example of retrieval samples from (a) SSVTP and (b) HCT with given inputs. The red bounding box indicates the region of contact sensed by the tactile sensor. Although five samples were retrieved, only three are shown for clarity.
  • ...and 21 more figures