Table of Contents
Fetching ...

TactEx: An Explainable Multimodal Robotic Interaction Framework for Human-Like Touch and Hardness Estimation

Felix Verstraete, Lan Wei, Wen Fan, Dandan Zhang

TL;DR

TactEx is presented, an explainable multimodal robotic interaction framework that unifies vision, touch, and language for human-like hardness estimation and interactive guidance that achieves 90% task success on simple user queries and generalises to novel tasks without large-scale tuning.

Abstract

Accurate perception of object hardness is essential for safe and dexterous contact-rich robotic manipulation. Here, we present TactEx, an explainable multimodal robotic interaction framework that unifies vision, touch, and language for human-like hardness estimation and interactive guidance. We evaluate TactEx on fruit-ripeness assessment, a representative task that requires both tactile sensing and contextual understanding. The system fuses GelSight-Mini tactile streams with RGB observations and language prompts. A ResNet50+LSTM model estimates hardness from sequential tactile data, while a cross-modal alignment module combines visual cues with guidance from a large language model (LLM). This explainable multimodal interface allows users to distinguish ripeness levels with statistically significant class separation (p < 0.01 for all fruit pairs). For touch placement, we compare YOLO with Grounded-SAM (GSAM) and find GSAM to be more robust for fine-grained segmentation and contact-site selection. A lightweight LLM parses user instructions and produces grounded natural-language explanations linked to the tactile outputs. In end-to-end evaluations, TactEx attains 90% task success on simple user queries and generalises to novel tasks without large-scale tuning. These results highlight the promise of combining pretrained visual and tactile models with language grounding to advance explainable, human-like touch perception and decision-making in robotics.

TactEx: An Explainable Multimodal Robotic Interaction Framework for Human-Like Touch and Hardness Estimation

TL;DR

TactEx is presented, an explainable multimodal robotic interaction framework that unifies vision, touch, and language for human-like hardness estimation and interactive guidance that achieves 90% task success on simple user queries and generalises to novel tasks without large-scale tuning.

Abstract

Accurate perception of object hardness is essential for safe and dexterous contact-rich robotic manipulation. Here, we present TactEx, an explainable multimodal robotic interaction framework that unifies vision, touch, and language for human-like hardness estimation and interactive guidance. We evaluate TactEx on fruit-ripeness assessment, a representative task that requires both tactile sensing and contextual understanding. The system fuses GelSight-Mini tactile streams with RGB observations and language prompts. A ResNet50+LSTM model estimates hardness from sequential tactile data, while a cross-modal alignment module combines visual cues with guidance from a large language model (LLM). This explainable multimodal interface allows users to distinguish ripeness levels with statistically significant class separation (p < 0.01 for all fruit pairs). For touch placement, we compare YOLO with Grounded-SAM (GSAM) and find GSAM to be more robust for fine-grained segmentation and contact-site selection. A lightweight LLM parses user instructions and produces grounded natural-language explanations linked to the tactile outputs. In end-to-end evaluations, TactEx attains 90% task success on simple user queries and generalises to novel tasks without large-scale tuning. These results highlight the promise of combining pretrained visual and tactile models with language grounding to advance explainable, human-like touch perception and decision-making in robotics.
Paper Structure (37 sections, 1 equation, 6 figures, 4 tables)

This paper contains 37 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of TactEx (“The Tactile Explainer”), a multimodal framework for fruit ripeness explanation. Users interact via a chat interface (A), objects are localized with YOLO or GSAM (B1), hardness is estimated with a GelSight sensor (B2), and an LLM composes the final response from the fruit names, locations and hardness values (B3–C). The three components are detailed in \ref{['sec:serv']}, section \ref{['sec:tac']} and section \ref{['sec:llm']}, respectively.
  • Figure 2: Example of the Grounded SAM procedure: (a) original scene, (b) object detection with bounding box, (c) results SAM with inner mask for computing the centroid.
  • Figure 3: Data collection: images were compared to a reference image. If the contact criteria were met, 8 images were captured and transformed into a 2 or 4 image sequence.
  • Figure 4: Results of tactile predictions from the main ResNet50-LSTM3 model after pretraining (a) and fine-tuning (b). This is the model that will eventually be implemented within TactEx.
  • Figure 5: Success rates of the TactEx framework across four interaction scenarios of increasing complexity (Sc1-Sc4), as defined by Table \ref{['tab:complexity']}. SL-SR: Scenario-Level Success Rate, OL-SR: Object-Level Success Rate, w/o: without, Sc: Senario.
  • ...and 1 more figures