Table of Contents
Fetching ...

TacMamba: A Tactile History Compression Adapter Bridging Fast Reflexes and Slow VLA Reasoning

Zhenan Wang, Yanzhe Wang, Meixuan Ren, Peng Li, Yang Liu, Yifei Nie, Limin Long, Yun Ye, Xiaofeng Wang, Zhen Zhu, Huixu Dong

TL;DR

TacMamba is introduced, a hierarchical architecture that aligns high-bandwidth tactile reflexes with low-frequency visual planning and leverages temporal discrimination for self-supervised representation learning and phase-uniform sampling to mitigate data sparsity.

Abstract

In visually ambiguous manipulation such as detecting button click tactile feedback is often the sole source of ground truth. However, fusing tactile data poses a significant challenge due to a spatiotemporal mismatch: tactile perception requires high-frequency processing with long-horizon memory (System 1), whereas visual policies operate at low control frequencies (System 2). Existing architectures struggle to bridge this gap: Transformers are computationally prohibitive for high-frequency loops (>100Hz), while LSTMs suffer from forgetting over extended interaction histories. In this paper, we introduce TacMamba, a hierarchical architecture that aligns high-bandwidth tactile reflexes with low-frequency visual planning. Our approach comprises three core contributions: (1) a custom high-frequency tactile interface designed for flexible integration; (2) a Mamba-based Tactile History Compressor that encodes continuous force history into a compact state with O(1) inference latency (0.45 ms), enabling plug-and-play fusion with VLA models without joint pre-training and (3) a Tactile-Guided Dual-Stage Training strategy that leverages temporal discrimination for self-supervised representation learning and phase-uniform sampling to mitigate data sparsity. Experiments on discrete counting and implicit state switching demonstrate that TacMamba achieves 100% success rates, significantly outperforming the visual-only pi_0.5 baseline, while strictly satisfying hard real-time constraints.

TacMamba: A Tactile History Compression Adapter Bridging Fast Reflexes and Slow VLA Reasoning

TL;DR

TacMamba is introduced, a hierarchical architecture that aligns high-bandwidth tactile reflexes with low-frequency visual planning and leverages temporal discrimination for self-supervised representation learning and phase-uniform sampling to mitigate data sparsity.

Abstract

In visually ambiguous manipulation such as detecting button click tactile feedback is often the sole source of ground truth. However, fusing tactile data poses a significant challenge due to a spatiotemporal mismatch: tactile perception requires high-frequency processing with long-horizon memory (System 1), whereas visual policies operate at low control frequencies (System 2). Existing architectures struggle to bridge this gap: Transformers are computationally prohibitive for high-frequency loops (>100Hz), while LSTMs suffer from forgetting over extended interaction histories. In this paper, we introduce TacMamba, a hierarchical architecture that aligns high-bandwidth tactile reflexes with low-frequency visual planning. Our approach comprises three core contributions: (1) a custom high-frequency tactile interface designed for flexible integration; (2) a Mamba-based Tactile History Compressor that encodes continuous force history into a compact state with O(1) inference latency (0.45 ms), enabling plug-and-play fusion with VLA models without joint pre-training and (3) a Tactile-Guided Dual-Stage Training strategy that leverages temporal discrimination for self-supervised representation learning and phase-uniform sampling to mitigate data sparsity. Experiments on discrete counting and implicit state switching demonstrate that TacMamba achieves 100% success rates, significantly outperforming the visual-only pi_0.5 baseline, while strictly satisfying hard real-time constraints.
Paper Structure (22 sections, 4 equations, 7 figures, 1 table)

This paper contains 22 sections, 4 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: The TacMamba System Architecture. The framework bridges the spatiotemporal discrepancy between high-speed reflexes and low-frequency reasoning. Left (System 1): The tactile encoder processes 1D force streams at 100Hz using Mamba Models, recursively updating the hidden state $h_t$ in real-time. Right (System 2): This compressed hidden state $h_t$ is projected and asynchronously injected as a soft prompt into the low-frequency ($\sim$1Hz) Vision-Language-Action (VLA) planner.
  • Figure 2: System Overview. (a) Modular morphology-based tactile fingertip design, where an integrated compliant contact body supports both fingertip and fingerpad interactions and mechanically projects distributed contacts onto a single-axis force sensor; lateral side panels protect the internal structure, and a reconfigurable dual-clamp interface enables attachment to generic parallel grippers. (b) FEM-based characterization of force projection, showing peak contact pressure versus applied load for fingertip and fingerpad contacts, together with (c) representative visualizations of total deformation, equivalent (von Mises) stress, and contact pressure.
  • Figure 3: The TacMamba Network Architecture. Top: The core TacMamba backbone processes continuous tactile streams via a hierarchical Selective SSM, utilizing RevIN and Channel Independence for feature extraction. Bottom-Left: An expanded view of the Mamba block mechanism, illustrating how input-dependent parameters ($\Delta t, B_t, C_t$) model hybrid dynamics. Bottom-Right: The auxiliary discriminator network, employed exclusively during the training phase to facilitate robust feature learning and enforce temporal causality.
  • Figure 4: Efficiency Analysis. Inference latency and memory growth.
  • Figure 6: Global Task Success Rate. Comparison of full-task completion rates over training steps. TacMamba (Red) achieves rapid convergence and high robustness, while $\pi_{0.5}$ (Blue) suffers from catastrophic failure in the button task due to static frame overfitting.
  • ...and 2 more figures