Table of Contents
Fetching ...

VitaTouch: Property-Aware Vision-Tactile-Language Model for Robotic Quality Inspection in Manufacturing

Junyi Zong, Qingxuan Jia, Meixian Shi, Tong Li, Jiayuan Li, Zihang Lv, Gang Chen, Fang Deng

Abstract

Quality inspection in smart manufacturing requires identifying intrinsic material and surface properties beyond visible geometry, yet vision-only methods remain vulnerable to occlusion and reflection. We propose VitaTouch, a property-aware vision-tactile-language model for material-property inference and natural-language attribute description. VitaTouch uses modality-specific encoders and a dual Q-Former to extract language-relevant visual and tactile features, which are compressed into prefix tokens for a large language model. We align each modality with text and explicitly couple vision and touch through contrastive learning. We also construct VitaSet, a multimodal dataset with 186 objects, 52k images, and 5.1k human-verified instruction-answer pairs. VitaTouch achieves the best performance on HCT and the overall TVL benchmark, while remaining competitive on SSVTP. On VitaSet, it reaches 88.89% hardness accuracy, 75.13% roughness accuracy, and 54.81% descriptor recall; the material-description task further achieves a peak semantic similarity of 0.9009. With LoRA-based fine-tuning, VitaTouch attains 100.0%, 96.0%, and 92.0% accuracy for 2-, 3-, and 5-category defect recognition, respectively, and delivers 94.0% closed-loop recognition accuracy and 94.0% end-to-end sorting success in 100 laboratory robotic trials. More details are available at the project page: https://vitatouch.github.io/

VitaTouch: Property-Aware Vision-Tactile-Language Model for Robotic Quality Inspection in Manufacturing

Abstract

Quality inspection in smart manufacturing requires identifying intrinsic material and surface properties beyond visible geometry, yet vision-only methods remain vulnerable to occlusion and reflection. We propose VitaTouch, a property-aware vision-tactile-language model for material-property inference and natural-language attribute description. VitaTouch uses modality-specific encoders and a dual Q-Former to extract language-relevant visual and tactile features, which are compressed into prefix tokens for a large language model. We align each modality with text and explicitly couple vision and touch through contrastive learning. We also construct VitaSet, a multimodal dataset with 186 objects, 52k images, and 5.1k human-verified instruction-answer pairs. VitaTouch achieves the best performance on HCT and the overall TVL benchmark, while remaining competitive on SSVTP. On VitaSet, it reaches 88.89% hardness accuracy, 75.13% roughness accuracy, and 54.81% descriptor recall; the material-description task further achieves a peak semantic similarity of 0.9009. With LoRA-based fine-tuning, VitaTouch attains 100.0%, 96.0%, and 92.0% accuracy for 2-, 3-, and 5-category defect recognition, respectively, and delivers 94.0% closed-loop recognition accuracy and 94.0% end-to-end sorting success in 100 laboratory robotic trials. More details are available at the project page: https://vitatouch.github.io/

Paper Structure

This paper contains 36 sections, 23 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of VitaTouch. Left: three-stage training pipeline. Stage 1 performs cross-modal alignment via dual Q-Formers with InfoNCE and PTM losses. Stage 2 builds a property-reasoning multimodal model with fused V--T tokens in frozen Vicuna-7B. LoRA-based defect adaptation is then conducted over progressively finer-grained defect label spaces using few-shot labeled samples per category. Right: tactile sensing complements vision for property reasoning and defect recognition under visually ambiguous conditions.
  • Figure 2: VitaSet overview (Ours + AnyTouch GelSight-only). Aligned RGB observations and paired GelSight tactile readings across objects, with controlled-vocabulary annotations and dataset statistics under a unified schema.
  • Figure 3: VitaTouch model architecture. VitaTouch employs a dual-branch vision-tactile design with modality-specific encoders and Q-Formers. The Q-Formers distill learnable queries into vision-tactile prefix tokens, which are prepended to text embeddings and fed into a frozen Vicuna-7B decoder. Training proceeds in three stages: Stage 1 aligns cross-modal embeddings via frozen encoders; Stage 2 establishes the perception-to-language pathway for property reasoning; Stage 3 adapts the model for few-shot defect recognition using LoRA while freezing the backbone.
  • Figure 4: VitaSet validation performance of VitaTouch across training epochs. (a) Multi-task validation trends for hardness accuracy, roughness accuracy, descriptor recall, and mean task score. (b) Comparison of strict exact-match descriptor recall and semantic similarity for the material-property descriptor task. Stars ($\star$) mark the best epoch for each metric.
  • Figure 5: Ablation results on the VitaSet dataset across tasks. Each variant removes one key stage from the full model, demonstrating the necessity of explicit alignment and multimodal fusion for robust multi-task property learning.
  • ...and 1 more figures