Table of Contents
Fetching ...

Visuo-Tactile Zero-Shot Object Recognition with Vision-Language Model

Shiori Ueda, Atsushi Hashimoto, Masashi Hamaya, Kazutoshi Tanaka, Hideo Saito

TL;DR

This work proposes an approach to incorporate tactile data into a Vision-Language Model (VLM) for visuo-tactile zero-shot object recognition that leverages the zero-shot capability of VLM to infer tactile properties from the names of tactilely similar objects.

Abstract

Tactile perception is vital, especially when distinguishing visually similar objects. We propose an approach to incorporate tactile data into a Vision-Language Model (VLM) for visuo-tactile zero-shot object recognition. Our approach leverages the zero-shot capability of VLMs to infer tactile properties from the names of tactilely similar objects. The proposed method translates tactile data into a textual description solely by annotating object names for each tactile sequence during training, making it adaptable to various contexts with low training costs. The proposed method was evaluated on the FoodReplica and Cube datasets, demonstrating its effectiveness in recognizing objects that are difficult to distinguish by vision alone.

Visuo-Tactile Zero-Shot Object Recognition with Vision-Language Model

TL;DR

This work proposes an approach to incorporate tactile data into a Vision-Language Model (VLM) for visuo-tactile zero-shot object recognition that leverages the zero-shot capability of VLM to infer tactile properties from the names of tactilely similar objects.

Abstract

Tactile perception is vital, especially when distinguishing visually similar objects. We propose an approach to incorporate tactile data into a Vision-Language Model (VLM) for visuo-tactile zero-shot object recognition. Our approach leverages the zero-shot capability of VLMs to infer tactile properties from the names of tactilely similar objects. The proposed method translates tactile data into a textual description solely by annotating object names for each tactile sequence during training, making it adaptable to various contexts with low training costs. The proposed method was evaluated on the FoodReplica and Cube datasets, demonstrating its effectiveness in recognizing objects that are difficult to distinguish by vision alone.
Paper Structure (21 sections, 1 equation, 9 figures, 1 table)

This paper contains 21 sections, 1 equation, 9 figures, 1 table.

Figures (9)

  • Figure 1: Task overview. We are addressing the problem of recognizing objects that are difficult to distinguish by vision alone. To facilitate recognition, the robot collects tactile information in addition to visual observation. Tactile signals are converted into a text description (e.g., "similar to ${known reference object name(s)}") and fed to VLM. Based on the image, the textual tactile description, and common sense, VLM identifies the object class in a zero-shot manner.
  • Figure 2: The proposed pipeline. The tactile embedding network learns the tactile embedding from the tactile sequence. These embeddings are converted to textual descriptions in the tactile-to-text database. During inference, the Vision Language Model (VLM) receives a textual description along with the visual image and outputs the most likely class label for the input object in a zero-shot manner.
  • Figure 3: Prompt for visuo-tactile zero-shot object recognition. ${topk_refs} represents the reference classes of the top-$k$ nearest tactile embeddings. ${test_time_classes} denotes the set of labels of the test dataset.
  • Figure 4: Snapshots of the TactileReference dataset. It is used in two modules. In the tactile embedding network module, the dataset is split into 27 classes for training and 5 classes for validation. In the tactile-to-text database module, all classes of the dataset are used to construct the tactile-to-text database.
  • Figure 5: The FoodReplica dataset and the results. Class labels of replicas are given in the resin_replica_${name} format. The vision-only method predicted replicas as real in most cases, while the proposed (vision + tactile) method achieved a balanced performance.
  • ...and 4 more figures