TacMamba: A Tactile History Compression Adapter Bridging Fast Reflexes and Slow VLA Reasoning

Zhenan Wang; Yanzhe Wang; Meixuan Ren; Peng Li; Yang Liu; Yifei Nie; Limin Long; Yun Ye; Xiaofeng Wang; Zhen Zhu; Huixu Dong

TacMamba: A Tactile History Compression Adapter Bridging Fast Reflexes and Slow VLA Reasoning

Zhenan Wang, Yanzhe Wang, Meixuan Ren, Peng Li, Yang Liu, Yifei Nie, Limin Long, Yun Ye, Xiaofeng Wang, Zhen Zhu, Huixu Dong

TL;DR

TacMamba is introduced, a hierarchical architecture that aligns high-bandwidth tactile reflexes with low-frequency visual planning and leverages temporal discrimination for self-supervised representation learning and phase-uniform sampling to mitigate data sparsity.

Abstract

In visually ambiguous manipulation such as detecting button click tactile feedback is often the sole source of ground truth. However, fusing tactile data poses a significant challenge due to a spatiotemporal mismatch: tactile perception requires high-frequency processing with long-horizon memory (System 1), whereas visual policies operate at low control frequencies (System 2). Existing architectures struggle to bridge this gap: Transformers are computationally prohibitive for high-frequency loops (>100Hz), while LSTMs suffer from forgetting over extended interaction histories. In this paper, we introduce TacMamba, a hierarchical architecture that aligns high-bandwidth tactile reflexes with low-frequency visual planning. Our approach comprises three core contributions: (1) a custom high-frequency tactile interface designed for flexible integration; (2) a Mamba-based Tactile History Compressor that encodes continuous force history into a compact state with O(1) inference latency (0.45 ms), enabling plug-and-play fusion with VLA models without joint pre-training and (3) a Tactile-Guided Dual-Stage Training strategy that leverages temporal discrimination for self-supervised representation learning and phase-uniform sampling to mitigate data sparsity. Experiments on discrete counting and implicit state switching demonstrate that TacMamba achieves 100% success rates, significantly outperforming the visual-only pi_0.5 baseline, while strictly satisfying hard real-time constraints.

TacMamba: A Tactile History Compression Adapter Bridging Fast Reflexes and Slow VLA Reasoning

TL;DR

Abstract

Paper Structure (22 sections, 4 equations, 7 figures, 1 table)

This paper contains 22 sections, 4 equations, 7 figures, 1 table.

INTRODUCTION
RELATED WORK
Temporal model in VLA
Tactile-Augmented Robotic Manipulation
Efficient Long-Sequence Modeling
HARDWARE SYSTEM
METHODOLOGY
Tactile Encoder via Mamba Architecture
Continuous-Discrete Modeling
Capturing Hybrid Dynamics via Selectivity
Efficient Hierarchical Architecture with $\mathcal{O}(1)$ Inference
Training Strategy
Stage 1: Self-Supervised Pre-training via Temporal Discrimination
Stage 2: Tactile-Guided VLA Tuning with Phase Sampling
EXPERIMENTS
...and 7 more sections

Figures (7)

Figure 1: The TacMamba System Architecture. The framework bridges the spatiotemporal discrepancy between high-speed reflexes and low-frequency reasoning. Left (System 1): The tactile encoder processes 1D force streams at 100Hz using Mamba Models, recursively updating the hidden state $h_t$ in real-time. Right (System 2): This compressed hidden state $h_t$ is projected and asynchronously injected as a soft prompt into the low-frequency ($\sim$1Hz) Vision-Language-Action (VLA) planner.
Figure 2: System Overview. (a) Modular morphology-based tactile fingertip design, where an integrated compliant contact body supports both fingertip and fingerpad interactions and mechanically projects distributed contacts onto a single-axis force sensor; lateral side panels protect the internal structure, and a reconfigurable dual-clamp interface enables attachment to generic parallel grippers. (b) FEM-based characterization of force projection, showing peak contact pressure versus applied load for fingertip and fingerpad contacts, together with (c) representative visualizations of total deformation, equivalent (von Mises) stress, and contact pressure.
Figure 3: The TacMamba Network Architecture. Top: The core TacMamba backbone processes continuous tactile streams via a hierarchical Selective SSM, utilizing RevIN and Channel Independence for feature extraction. Bottom-Left: An expanded view of the Mamba block mechanism, illustrating how input-dependent parameters ($\Delta t, B_t, C_t$) model hybrid dynamics. Bottom-Right: The auxiliary discriminator network, employed exclusively during the training phase to facilitate robust feature learning and enforce temporal causality.
Figure 4: Efficiency Analysis. Inference latency and memory growth.
Figure 6: Global Task Success Rate. Comparison of full-task completion rates over training steps. TacMamba (Red) achieves rapid convergence and high robustness, while $\pi_{0.5}$ (Blue) suffers from catastrophic failure in the button task due to static frame overfitting.
...and 2 more figures

TacMamba: A Tactile History Compression Adapter Bridging Fast Reflexes and Slow VLA Reasoning

TL;DR

Abstract

TacMamba: A Tactile History Compression Adapter Bridging Fast Reflexes and Slow VLA Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)