Table of Contents
Fetching ...

TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models

Chenghao Liu, Jiachen Zhang, Chengxuan Li, Zhimu Zhou, Shixin Wu, Songfang Huang, Huiling Duan

TL;DR

This work tackles the temporal myopia in Vision-Language-Action models by introducing Temporal Token Fusion (TTF), a training-free framework that intelligently fuses historical and current visual tokens through dual-dimension detection (grayscale pixel differences and attention-based semantic relevance) and a hard fusion mechanism complemented by a keyframe strategy. The method is model-agnostic and validated across LIBERO, SimplerEnv, and real-robot tasks, showing consistent improvements (e.g., +4.0 percentage points on LIBERO, +4.8% relative on SimplerEnv, +8.7% relative on real robots) with minimal runtime overhead. A notable finding is that selective Query matrix reuse in attention can enhance performance, suggesting future avenues for direct KQV reuse to accelerate inference. The work contributes a principled approach to leveraging temporal context in VLA systems and opens practical directions for more efficient attention-based inference in robotics.

Abstract

Vision-Language-Action (VLA) models process visual inputs independently at each timestep, discarding valuable temporal information inherent in robotic manipulation tasks. This frame-by-frame processing makes models vulnerable to visual noise while ignoring the substantial coherence between consecutive frames in manipulation sequences. We propose Temporal Token Fusion (TTF), a training-free approach that intelligently integrates historical and current visual representations to enhance VLA inference quality. Our method employs dual-dimension detection combining efficient grayscale pixel difference analysis with attention-based semantic relevance assessment, enabling selective temporal token fusion through hard fusion strategies and keyframe anchoring to prevent error accumulation. Comprehensive experiments across LIBERO, SimplerEnv, and real robot tasks demonstrate consistent improvements: 4.0 percentage points average on LIBERO (72.4\% vs 68.4\% baseline), cross-environment validation on SimplerEnv (4.8\% relative improvement), and 8.7\% relative improvement on real robot tasks. Our approach proves model-agnostic, working across OpenVLA and VLA-Cache architectures. Notably, TTF reveals that selective Query matrix reuse in attention mechanisms enhances rather than compromises performance, suggesting promising directions for direct KQV matrix reuse strategies that achieve computational acceleration while improving task success rates.

TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models

TL;DR

This work tackles the temporal myopia in Vision-Language-Action models by introducing Temporal Token Fusion (TTF), a training-free framework that intelligently fuses historical and current visual tokens through dual-dimension detection (grayscale pixel differences and attention-based semantic relevance) and a hard fusion mechanism complemented by a keyframe strategy. The method is model-agnostic and validated across LIBERO, SimplerEnv, and real-robot tasks, showing consistent improvements (e.g., +4.0 percentage points on LIBERO, +4.8% relative on SimplerEnv, +8.7% relative on real robots) with minimal runtime overhead. A notable finding is that selective Query matrix reuse in attention can enhance performance, suggesting future avenues for direct KQV reuse to accelerate inference. The work contributes a principled approach to leveraging temporal context in VLA systems and opens practical directions for more efficient attention-based inference in robotics.

Abstract

Vision-Language-Action (VLA) models process visual inputs independently at each timestep, discarding valuable temporal information inherent in robotic manipulation tasks. This frame-by-frame processing makes models vulnerable to visual noise while ignoring the substantial coherence between consecutive frames in manipulation sequences. We propose Temporal Token Fusion (TTF), a training-free approach that intelligently integrates historical and current visual representations to enhance VLA inference quality. Our method employs dual-dimension detection combining efficient grayscale pixel difference analysis with attention-based semantic relevance assessment, enabling selective temporal token fusion through hard fusion strategies and keyframe anchoring to prevent error accumulation. Comprehensive experiments across LIBERO, SimplerEnv, and real robot tasks demonstrate consistent improvements: 4.0 percentage points average on LIBERO (72.4\% vs 68.4\% baseline), cross-environment validation on SimplerEnv (4.8\% relative improvement), and 8.7\% relative improvement on real robot tasks. Our approach proves model-agnostic, working across OpenVLA and VLA-Cache architectures. Notably, TTF reveals that selective Query matrix reuse in attention mechanisms enhances rather than compromises performance, suggesting promising directions for direct KQV matrix reuse strategies that achieve computational acceleration while improving task success rates.

Paper Structure

This paper contains 23 sections, 9 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overall Framework of Temporal Token Fusion for VLA Models. The framework illustrates the end-to-end process, where the Vision Encoder extracts tokens from current (Observation$_t$) and previous (Observation$_{t-1}$) frames. These are then processed by the Patch Selection module and TTF module for patch selection and token fusion. The fused tokens are subsequently fed into the LLM Backbone, combined with language instruction, to generate 7-DoF robotic actions via the Action Detokenizer.
  • Figure 2: The Details of Patch Selection and Temporal Token Fusion. The process includes (a) Grayscale Pixel Difference Detection and Attention-Based Semantic Relevance Detection for identifying important patches, and (b) fusion of selected Tokens$_t$ with Tokens$_{t-1}$ into Fused Tokens$_t$, where important patches use current frame tokens and others use previous frame tokens.
  • Figure 3: Temporal progression analysis illustrating a failure-to-success transition for the task instruction "pick up the butter and place it in the basket." Eight key phases showing OpenVLA baseline (failed) vs. OpenVLA + TTF (successful), demonstrating the critical role of temporal consistency in successful manipulation.
  • Figure 4: Real robot manipulation tasks used for physical validation of TTF: (a) single-object pick-and-place, (b) multi-object sequential manipulation, and (c) contact-rich drawer closing.
  • Figure 5: Keyframe interval analysis across Object and Long task suites. Top: Performance vs. Keyframe Interval showing error accumulation beyond K=30. Bottom: Fusion Rates vs. Keyframe Interval revealing the efficiency-performance trade-off. The analysis demonstrates three distinct regimes: stable performance (K$\leq$15), degradation onset (K=20--30), and error accumulation (K$\geq$30).