Tactile Modality Fusion for Vision-Language-Action Models

Charlotte Morissette; Amin Abyaneh; Wei-Di Chang; Anas Houssaini; David Meger; Hsiu-Chin Lin; Jonathan Tremblay; Gregory Dudek

Tactile Modality Fusion for Vision-Language-Action Models

Charlotte Morissette, Amin Abyaneh, Wei-Di Chang, Anas Houssaini, David Meger, Hsiu-Chin Lin, Jonathan Tremblay, Gregory Dudek

Abstract

We propose TacFiLM, a lightweight modality-fusion approach that integrates visual-tactile signals into vision-language-action (VLA) models. While recent advances in VLA models have introduced robot policies that are both generalizable and semantically grounded, these models mainly rely on vision-based perception. Vision alone, however, cannot capture the complex interaction dynamics that occur during contact-rich manipulation, including contact forces, surface friction, compliance, and shear. While recent attempts to integrate tactile signals into VLA models often increase complexity through token concatenation or large-scale pretraining, the heavy computational demands of behavioural models necessitate more lightweight fusion strategies. To address these challenges, TacFiLM outlines a post-training finetuning approach that conditions intermediate visual features on pretrained tactile representations using feature-wise linear modulation (FiLM). Experimental results on insertion tasks demonstrate consistent improvements in success rate, direct insertion performance, completion time, and force stability across both in-distribution and out-of-distribution tasks. Together, these results support our method as an effective approach to integrating tactile signals into VLA models, improving contact-rich manipulation behaviours.

Tactile Modality Fusion for Vision-Language-Action Models

Abstract

Paper Structure (26 sections, 1 equation, 5 figures, 3 tables)

This paper contains 26 sections, 1 equation, 5 figures, 3 tables.

Introduction
Related Work
Vision-Language-Action (VLA) Models
Tactile Sensing in Robot Learning
Methodology
Policy Architecture
FiLM-Based Fusion
Pretrained Tactile Representations
T3
Sparsh
Training
Experiments
Experiment Setup
Tasks and Data Collection
Evaluation Methods
...and 11 more sections

Figures (5)

Figure 1: TacFiLM Overview We present TacFiLM, a lightweight modality-fusion approach for integrating visual-tactile signals into VLA models. The left panel shows the model inputs, including tactile, visual, and language modalities. In grey, baseline approaches: a vision-only VLA and a tactile-concatenation architecture. To the right, we show our proposed TacFiLM-augmented VLA, where FiLM layers condition intermediate visual features. The rightmost boxes show model outputs and rollouts.
Figure 2: TacFiLM's modality fusion pipeline. Tactile embeddings are integrated into the vision backbone immediately preceding the multi-head attention layers. The resulting multimodal tokens, combined with language inputs, serve as the basis for action generation within the Llama backbone.
Figure 3: Task definitions. Insertion tasks are characterized by different shapes and clearances of the peg or connector. The shared goal across all tasks is characterized by successful insertion of the peg or connector.
Figure 4: Experiment setup (left) and sample rollouts (right). Franka parallel grippers, which the robot uses to hold the pegs, have a DIGIT tactile sensor installed on them. The rollouts demonstrate circle peg and USB connector insertions.
Figure 5: Force and task completion time analysis. Top row: Average force measurements across successfully recovered insertions for the in-distribution (ID) tasks. Bottom row: Task completion times across different methods for ID tasks. The results demonstrate that tactile-aware methods prevent excessive force application while TacFiLM also significantly reduces task completion time.

Tactile Modality Fusion for Vision-Language-Action Models

Abstract

Tactile Modality Fusion for Vision-Language-Action Models

Authors

Abstract

Table of Contents

Figures (5)