Test-Time Adaptation for Tactile-Vision-Language Models

Chuyang Ye; Haoxian Jing; Qinting Jiang; Yixi Lin; Qiang Li; Xing Tang; Jingyan Jiang

Test-Time Adaptation for Tactile-Vision-Language Models

Chuyang Ye, Haoxian Jing, Qinting Jiang, Yixi Lin, Qiang Li, Xing Tang, Jingyan Jiang

TL;DR

This work studies TTA for TVL models under such shifts and proposes a reliability-aware framework that estimates per-modality reliability from prediction uncertainty and perturbation-based responses, underscoring the importance of explicit modality-wise reliability modeling for robust test-time adaptation.

Abstract

Tactile-vision-language (TVL) models are increasingly deployed in real-world robotic and multimodal perception tasks, where test-time distribution shifts are unavoidable. Existing test-time adaptation (TTA) methods provide filtering in unimodal settings but lack explicit treatment of modality-wise reliability under asynchronous cross-modal shifts, leaving them brittle when some modalities become unreliable. We study TTA for TVL models under such shifts and propose a reliability-aware framework that estimates per-modality reliability from prediction uncertainty and perturbation-based responses. This shared reliability signal is used to (i) filter unreliable test samples, (ii) adaptively fuse tactile, visual, and language features, and (iii) regularize test-time optimization with a reliability-guided objective. On the TAG-C benchmark and additional TVL scenarios, our approach consistently outperforms strong TTA baselines, achieving accuracy gains of up to 49.9\% under severe modality corruptions, underscoring the importance of explicit modality-wise reliability modeling for robust test-time adaptation.

Test-Time Adaptation for Tactile-Vision-Language Models

TL;DR

Abstract

Paper Structure (22 sections, 14 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 22 sections, 14 equations, 7 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Formulation and Key Observations
Problem definition
Key observations
High-reliable samples are more beneficial for model adaptation.
The inference performance of multi-modal fusion is more robust.
Methodology
Dynamic Reliable Sample Filtering
Perturbation-based Reliability Indicators.
Dynamic Thresholds and Sample Updating.
Dynamic Modality Fusion
Reliability-Aware Loss
Experiments
Experiment Settings
...and 7 more sections

Figures (7)

Figure 1: Illustration of Asynchronous Distribution Shifts in the Wild Multi-Modal Settings. (a) Illustration of asynchronous distribution shifts in vision and tactile modalities, where one modality experiences corruption (✗) while the other remains reliable (✓), simulating unpredictable real-world conditions such as sensor degradation during a robot's rescue mission in disaster zones. (b) Performance degradation under these shifts, comparing Clip-based Zero-shot classification accuracy (%) of baseline methods (Source, TENT, SAR, READ) against our RobustTouch framework, highlighting its superior adaptation to maintain robust perception amid modality-specific uncertainties.
Figure 2: Experiments for Data Reliability.(a) Illustration of High-reliable versus low-reliable samples in multi-modal data under distribution shifts by using Grad-CAM. High-reliable samples effectively capture relevant features, facilitating robust model adaptation, while low-reliable samples introduce errors by emphasizing irrelevant or missing details. (b) Impact of Data Quality on Material Classification Accuracy Using TENT in TAG-C Datasets. This bar chart compares classification accuracy (%) for Touch-Text and Image-Text modalities when adapting with all available data versus only clean data under distribution shifts. Results demonstrate that High-reliable (clean) samples significantly boost performance, particularly for tactile data, while noisy data in all samples leads to degradation.
Figure 3: Comparison of Accuracy in Material Classification.(a) Accuracy across different modalities and fusion methods. Fused Emb. refers to averaging the embeddings of the two modalities, while Fused Logits refers to averaging Tactile-Language and Vision-Language logits. The accuracy gain on the bars is calculated by the difference between the worst performance of single modality accuracy and the best performance in the fusion method. (b) Accuracy comparison in the material classification under continual test-time adaptation scenarios. The source method only evaluates the data without adaptation.
Figure 4: Overview of the RobustTouch algorithm. The algorithm consists of three key modules: (1) Dynamic Reliable Sample Filtering, which identifies reliable samples based on perturbation-based reliability indicators; (2) Dynamic Modality Fusion, which adaptively combines vision and touch embeddings using a lightweight fusion network; (3) Reliability-Aware Loss Design, which employs modality-specific losses to drive adaptation. The model updates are focused on attention QKV cache and adapters, enabling efficient test-time adaptation without requiring labels.
Figure 5: (a) Performance Trend over Time under the dynamic wild setting on TAG-C benchmark (corrupted visual modality). (b) Sensitivity test of batch size in tactile modality corruption on TAG-C benchmark (corrupted tactile modality). (c) Sensitivity test of different $\lambda$ on TAG-C benchmark (corrupted tactile modality). (d) Sensitivity test of different $\alpha$ on TAG-C benchmark (corrupted tactile modality).
...and 2 more figures

Theorems & Definitions (2)

Definition 1: Prediction Uncertainty
Definition 2: Confidence Variation

Test-Time Adaptation for Tactile-Vision-Language Models

TL;DR

Abstract

Test-Time Adaptation for Tactile-Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (2)