Table of Contents
Fetching ...

FG-CLTP: Fine-Grained Contrastive Language Tactile Pretraining for Robotic Manipulation

Wenxuan Ma, Chaofan Zhang, Yinghao Cai, Guocai Yao, Shaowei Cui, Shuo Wang

TL;DR

This work proposes FG-CLTP, a fine-grained contrastive language tactile pretraining framework, and develops a 3D tactile-language-action (3D-TLA) architecture driven by a flow matching policy to enable multimodal reasoning and control.

Abstract

Recent advancements in integrating tactile sensing into vision-language-action (VLA) models have demonstrated transformative potential for robotic perception. However, existing tactile representations predominantly rely on qualitative descriptors (e.g., texture), neglecting quantitative contact states such as force magnitude, contact geometry, and principal axis orientation, which are indispensable for fine-grained manipulation. To bridge this gap, we propose FG-CLTP, a fine-grained contrastive language tactile pretraining framework. We first introduce a novel dataset comprising over 100k tactile 3D point cloud-language pairs that explicitly capture multidimensional contact states from the sensor's perspective. We then implement a discretized numerical tokenization mechanism to achieve quantitative-semantic alignment, effectively injecting explicit physical metrics into the multimodal feature space. The proposed FG-CLTP model yields a 95.9% classification accuracy and reduces the regression error (MAE) by 52.6% compared to state-of-the-art methods. Furthermore, the integration of 3D point cloud representations establishes a sensor-agnostic foundation with a minimal sim-to-real gap of 3.5%. Building upon this fine-grained representation, we develop a 3D tactile-language-action (3D-TLA) architecture driven by a flow matching policy to enable multimodal reasoning and control. Extensive experiments demonstrate that our framework significantly outperforms strong baselines in contact-rich manipulation tasks, providing a robust and generalizable foundation for tactile-language-action models.

FG-CLTP: Fine-Grained Contrastive Language Tactile Pretraining for Robotic Manipulation

TL;DR

This work proposes FG-CLTP, a fine-grained contrastive language tactile pretraining framework, and develops a 3D tactile-language-action (3D-TLA) architecture driven by a flow matching policy to enable multimodal reasoning and control.

Abstract

Recent advancements in integrating tactile sensing into vision-language-action (VLA) models have demonstrated transformative potential for robotic perception. However, existing tactile representations predominantly rely on qualitative descriptors (e.g., texture), neglecting quantitative contact states such as force magnitude, contact geometry, and principal axis orientation, which are indispensable for fine-grained manipulation. To bridge this gap, we propose FG-CLTP, a fine-grained contrastive language tactile pretraining framework. We first introduce a novel dataset comprising over 100k tactile 3D point cloud-language pairs that explicitly capture multidimensional contact states from the sensor's perspective. We then implement a discretized numerical tokenization mechanism to achieve quantitative-semantic alignment, effectively injecting explicit physical metrics into the multimodal feature space. The proposed FG-CLTP model yields a 95.9% classification accuracy and reduces the regression error (MAE) by 52.6% compared to state-of-the-art methods. Furthermore, the integration of 3D point cloud representations establishes a sensor-agnostic foundation with a minimal sim-to-real gap of 3.5%. Building upon this fine-grained representation, we develop a 3D tactile-language-action (3D-TLA) architecture driven by a flow matching policy to enable multimodal reasoning and control. Extensive experiments demonstrate that our framework significantly outperforms strong baselines in contact-rich manipulation tasks, providing a robust and generalizable foundation for tactile-language-action models.
Paper Structure (23 sections, 4 equations, 6 figures, 4 tables)

This paper contains 23 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of the FG-CLTP Framework. The fine-grained contrastive language-tactile pretraining (FG-CLTP) method aligns 3D tactile point clouds with quantitative contact-state-aware language descriptions. The pretrained encoder is integrated into a flow matching policy (3D-TLA), enabling tactile-based multimodal reasoning and action generation.
  • Figure 2: Overview of the Contact3D Dataset. The Contact3D dataset integrates real-world and simulated multi-sensor data, implementing automated collection through pressing, sliding, and rotating primitives, alongside comprehensive contact state annotations for holistic representation.
  • Figure 3: Architecture of the FG-CLTP. Tactile point clouds, tactile images, and language descriptions are processed through their respective encoders. For text tokenization, discrete numerical tokens are introduced for fine-grained alignment. The original tokens are frozen, while the newly added tokens are learnable. Contrastive learning is employed for semantic alignment within the feature space. Additionally, an explicit physical attribute regression loss is incorporated to enhance the discriminability of specific physical properties.
  • Figure 4: Contact State Classification Results.
  • Figure 5: A case study of generating text descriptions from realistic tactile point clouds.
  • ...and 1 more figures