Table of Contents
Fetching ...

Octopi: Object Property Reasoning with Large Tactile-Language Models

Samson Yu, Kelvin Lin, Anxing Xiao, Jiafei Duan, Harold Soh

TL;DR

This work addresses the gap in physical reasoning for robots by integrating tactile sensing with large vision-language models. It introduces PhysiCLeAR, a GelSight-based tactile dataset with hardness, roughness, and bumpiness annotations, and five reasoning tasks, and presents Octopi, a tactile-grounded LVLM that learns tactile representations through a CLIP-based encoder and aligns them with a Vicuna/LLaMA LLM via a three-stage training pipeline (encoder fine-tuning, tactile feature alignment, and end-to-end fine-tuning with LoRA). Experiments show that object-property descriptions grounded in tactile data improve reasoning and task performance, including avocado ripeness classification, and ablations demonstrate the value of a fine-tuned visual encoder and parameter-efficient end-to-end training. The results highlight the potential of tactile-language grounding to improve robustness under visual ambiguity and to expand the capabilities of embodied AI systems. The work suggests future directions in richer tactile modalities and cross-modal alignment for broader real-world robotics applications.

Abstract

Physical reasoning is important for effective robot manipulation. Recent work has investigated both vision and language modalities for physical reasoning; vision can reveal information about objects in the environment and language serves as an abstraction and communication medium for additional context. Although these works have demonstrated success on a variety of physical reasoning tasks, they are limited to physical properties that can be inferred from visual or language inputs. In this work, we investigate combining tactile perception with language, which enables embodied systems to obtain physical properties through interaction and apply commonsense reasoning. We contribute a new dataset PhysiCLeAR, which comprises both physical/property reasoning tasks and annotated tactile videos obtained using a GelSight tactile sensor. We then introduce Octopi, a system that leverages both tactile representation learning and large vision-language models to predict and reason about tactile inputs with minimal language fine-tuning. Our evaluations on PhysiCLeAR show that Octopi is able to effectively use intermediate physical property predictions to improve its performance on various tactile-related tasks. PhysiCLeAR and Octopi are available at https://github.com/clear-nus/octopi.

Octopi: Object Property Reasoning with Large Tactile-Language Models

TL;DR

This work addresses the gap in physical reasoning for robots by integrating tactile sensing with large vision-language models. It introduces PhysiCLeAR, a GelSight-based tactile dataset with hardness, roughness, and bumpiness annotations, and five reasoning tasks, and presents Octopi, a tactile-grounded LVLM that learns tactile representations through a CLIP-based encoder and aligns them with a Vicuna/LLaMA LLM via a three-stage training pipeline (encoder fine-tuning, tactile feature alignment, and end-to-end fine-tuning with LoRA). Experiments show that object-property descriptions grounded in tactile data improve reasoning and task performance, including avocado ripeness classification, and ablations demonstrate the value of a fine-tuned visual encoder and parameter-efficient end-to-end training. The results highlight the potential of tactile-language grounding to improve robustness under visual ambiguity and to expand the capabilities of embodied AI systems. The work suggests future directions in richer tactile modalities and cross-modal alignment for broader real-world robotics applications.

Abstract

Physical reasoning is important for effective robot manipulation. Recent work has investigated both vision and language modalities for physical reasoning; vision can reveal information about objects in the environment and language serves as an abstraction and communication medium for additional context. Although these works have demonstrated success on a variety of physical reasoning tasks, they are limited to physical properties that can be inferred from visual or language inputs. In this work, we investigate combining tactile perception with language, which enables embodied systems to obtain physical properties through interaction and apply commonsense reasoning. We contribute a new dataset PhysiCLeAR, which comprises both physical/property reasoning tasks and annotated tactile videos obtained using a GelSight tactile sensor. We then introduce Octopi, a system that leverages both tactile representation learning and large vision-language models to predict and reason about tactile inputs with minimal language fine-tuning. Our evaluations on PhysiCLeAR show that Octopi is able to effectively use intermediate physical property predictions to improve its performance on various tactile-related tasks. PhysiCLeAR and Octopi are available at https://github.com/clear-nus/octopi.
Paper Structure (30 sections, 8 figures, 22 tables)

This paper contains 30 sections, 8 figures, 22 tables.

Figures (8)

  • Figure 1: Avocado ripeness selection by combining tactile information with commonsense knowledge. Using inputs from its tactile sensor, Octopi identifies the left avocado as softer. Using commonsense reasoning, Octopi infers that it is ripe and fulfils the user's request.
  • Figure 2: PhysiCLeAR and Octopi (with key contributions starred). We collect tactile videos for everyday household objects by hand with two exploratory procedures: pressing and rotation. The videos are annotated by three annotators for three physical properties: hardness, roughness and bumpiness. PhysiCLeAR leverages the videos and annotations for five language-driven physical description and understanding tasks. Octopi is a LVLM fine-tuned on PhysiCLeAR for tactile-grounded physical understanding and reasoning.
  • Figure 3: Octopi Framework. Our framework consists of CLIP's visual encoder, a projection module with two linear layers, and Vicuna v1.5 as the LLM. Language embeddings are derived through tokenization and then Vicuna's word embedding layer, with <tact_start> and <tact_end> being newly trained word embeddings indicating the start and end of a tactile frame sequence from a single tactile sensor. Tactile frames are fed into the visual encoder followed by the projection module to derive tactile embeddings with the same dimension as the word embeddings.
  • Figure 4: Rice (Cooked v.s. Uncooked) Reasoning.Octopi-13b is prompted to reason about whether a scoop of rice is more likely to be cooked or uncooked based on a tactile video of a scoop on uncooked rice. It reasons about the rice state correctly without being trained to do so.
  • Figure 5: Toothbrush Part Reasoning. Given a tactile video of a toothbrush's handle and the same toothbrush's bristles, Octopi-13b is prompted to reason which tactile readings belong to the handle and which belongs to the bristles.
  • ...and 3 more figures