Octopi: Object Property Reasoning with Large Tactile-Language Models
Samson Yu, Kelvin Lin, Anxing Xiao, Jiafei Duan, Harold Soh
TL;DR
This work addresses the gap in physical reasoning for robots by integrating tactile sensing with large vision-language models. It introduces PhysiCLeAR, a GelSight-based tactile dataset with hardness, roughness, and bumpiness annotations, and five reasoning tasks, and presents Octopi, a tactile-grounded LVLM that learns tactile representations through a CLIP-based encoder and aligns them with a Vicuna/LLaMA LLM via a three-stage training pipeline (encoder fine-tuning, tactile feature alignment, and end-to-end fine-tuning with LoRA). Experiments show that object-property descriptions grounded in tactile data improve reasoning and task performance, including avocado ripeness classification, and ablations demonstrate the value of a fine-tuned visual encoder and parameter-efficient end-to-end training. The results highlight the potential of tactile-language grounding to improve robustness under visual ambiguity and to expand the capabilities of embodied AI systems. The work suggests future directions in richer tactile modalities and cross-modal alignment for broader real-world robotics applications.
Abstract
Physical reasoning is important for effective robot manipulation. Recent work has investigated both vision and language modalities for physical reasoning; vision can reveal information about objects in the environment and language serves as an abstraction and communication medium for additional context. Although these works have demonstrated success on a variety of physical reasoning tasks, they are limited to physical properties that can be inferred from visual or language inputs. In this work, we investigate combining tactile perception with language, which enables embodied systems to obtain physical properties through interaction and apply commonsense reasoning. We contribute a new dataset PhysiCLeAR, which comprises both physical/property reasoning tasks and annotated tactile videos obtained using a GelSight tactile sensor. We then introduce Octopi, a system that leverages both tactile representation learning and large vision-language models to predict and reason about tactile inputs with minimal language fine-tuning. Our evaluations on PhysiCLeAR show that Octopi is able to effectively use intermediate physical property predictions to improve its performance on various tactile-related tasks. PhysiCLeAR and Octopi are available at https://github.com/clear-nus/octopi.
