Table of Contents
Fetching ...

Universal Visuo-Tactile Video Understanding for Embodied Interaction

Yifan Xie, Mingyang Li, Shoujie Li, Xingting Li, Guangyu Chen, Fei Ma, Fei Richard Yu, Wenbo Ding

TL;DR

This work addresses the gap in vision-language models lacking tactile grounding for embodied interaction by introducing VTV-LLM, a visuo-tactile video large language model. It introduces VTV150K, a large cross-sensor dataset of 150k visuo-tactile frames from 100 objects across GelSight Mini, DIGIT, and Tac3D, annotated with four tactile attributes, plus 10k QA templates. A three-stage training paradigm—VTV enhancement with optical-flow-guided masking, VTV-text alignment, and text prompt finetuning—bridges tactile perception and language by mapping video features $F_{VTV}$ to $E_V$ and aligning with text embeddings $E_T$ to produce outputs $A = f_{LLM}( ext{Concat}([E_V; E_T]))$. Experimental results show VTV-LLM achieves superior tactile understanding across feature assessment, surface distinctions, and scenario analyses, enabling more intuitive and capable embodied interaction in tactile domains.

Abstract

Tactile perception is essential for embodied agents to understand physical attributes of objects that cannot be determined through visual inspection alone. While existing approaches have made progress in visual and language modalities for physical understanding, they fail to effectively incorporate tactile information that provides crucial haptic feedback for real-world interaction. In this paper, we present VTV-LLM, the first multi-modal large language model for universal Visuo-Tactile Video (VTV) understanding that bridges the gap between tactile perception and natural language. To address the challenges of cross-sensor and cross-modal integration, we contribute VTV150K, a comprehensive dataset comprising 150,000 video frames from 100 diverse objects captured across three different tactile sensors (GelSight Mini, DIGIT, and Tac3D), annotated with four fundamental tactile attributes (hardness, protrusion, elasticity, and friction). We develop a novel three-stage training paradigm that includes VTV enhancement for robust visuo-tactile representation, VTV-text alignment for cross-modal correspondence, and text prompt finetuning for natural language generation. Our framework enables sophisticated tactile reasoning capabilities including feature assessment, comparative analysis, scenario-based decision making and so on. Experimental evaluations demonstrate that VTV-LLM achieves superior performance in tactile video understanding tasks, establishing a foundation for more intuitive human-machine interaction in tactile domains.

Universal Visuo-Tactile Video Understanding for Embodied Interaction

TL;DR

This work addresses the gap in vision-language models lacking tactile grounding for embodied interaction by introducing VTV-LLM, a visuo-tactile video large language model. It introduces VTV150K, a large cross-sensor dataset of 150k visuo-tactile frames from 100 objects across GelSight Mini, DIGIT, and Tac3D, annotated with four tactile attributes, plus 10k QA templates. A three-stage training paradigm—VTV enhancement with optical-flow-guided masking, VTV-text alignment, and text prompt finetuning—bridges tactile perception and language by mapping video features to and aligning with text embeddings to produce outputs . Experimental results show VTV-LLM achieves superior tactile understanding across feature assessment, surface distinctions, and scenario analyses, enabling more intuitive and capable embodied interaction in tactile domains.

Abstract

Tactile perception is essential for embodied agents to understand physical attributes of objects that cannot be determined through visual inspection alone. While existing approaches have made progress in visual and language modalities for physical understanding, they fail to effectively incorporate tactile information that provides crucial haptic feedback for real-world interaction. In this paper, we present VTV-LLM, the first multi-modal large language model for universal Visuo-Tactile Video (VTV) understanding that bridges the gap between tactile perception and natural language. To address the challenges of cross-sensor and cross-modal integration, we contribute VTV150K, a comprehensive dataset comprising 150,000 video frames from 100 diverse objects captured across three different tactile sensors (GelSight Mini, DIGIT, and Tac3D), annotated with four fundamental tactile attributes (hardness, protrusion, elasticity, and friction). We develop a novel three-stage training paradigm that includes VTV enhancement for robust visuo-tactile representation, VTV-text alignment for cross-modal correspondence, and text prompt finetuning for natural language generation. Our framework enables sophisticated tactile reasoning capabilities including feature assessment, comparative analysis, scenario-based decision making and so on. Experimental evaluations demonstrate that VTV-LLM achieves superior performance in tactile video understanding tasks, establishing a foundation for more intuitive human-machine interaction in tactile domains.

Paper Structure

This paper contains 24 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The workflow consists of four key components: (a) Data Collection, which includes 100 diverse objects recorded by 3 different tactile sensors, resulting in 150,000 video frames; (b) Attribute Annotation, where objects are systematically categorized across 4 static and dynamic tactile attributes: hardness, protrusion, elasticity, and friction; (c) Template Generation, which generates 10,000 question-answer pairs using structured templates for various comparative analyses; and (d) Embodied Interaction, demonstrating VTV-LLM's capability to perform tactile feature assessment, surface feature distinction, tactile scenario analysis and so on. Through this integrated approach, VTV-LLM enables multi-modal reasoning about physical attributes that cannot be determined through visual inspection alone, creating a foundation for more sophisticated human-machine interaction in tactile understanding domains.
  • Figure 2: (a) VTV-LLM framework: A multi-modal system integrating visual-tactile video data with large language models to facilitate tactile reasoning for embodied interaction; (b) Multi-Stage Training: It consists of VTV enhancement, alignment between visuo-tactile video and text, and prompt-based finetuning to generate accurate tactile descriptions.
  • Figure 3: Training pipeline of VTV enhancement.
  • Figure 4: Several task examples from the proposed VTV150K along with predictions from VTV-LLM.
  • Figure 5: Performance comparison of VTV-LLM on the different parameters.