Table of Contents
Fetching ...

TempoFit: Plug-and-Play Layer-Wise Temporal KV Memory for Long-Horizon Vision-Language-Action Manipulation

Jun Sun, Boyu Yang, Jiahao Zhang, Ning Ma, Chencheng Wu, Siqing Zhang, Yiou Huang, Qiufeng Wang, Shan Liang, Yaran Chen

Abstract

Pretrained Vision-Language-Action (VLA) policies have achieved strong single-step manipulation, but their inference remains largely memoryless, which is brittle in non-Markovian long-horizon settings with occlusion, state aliasing, and subtle post-action changes. Prior approaches inject history either by stacking frames, which scales visual tokens and latency while adding near-duplicate pixels, or by learning additional temporal interfaces that require (re-)training and may break the original single-frame inference graph. We present TempoFit, a training-free temporal retrofit that upgrades frozen VLAs through state-level memory. Our key insight is that prefix attention K/V already form a model-native, content-addressable runtime state; reusing them across timesteps introduces history without new tokens or trainable modules. TempoFit stores layer-wise FIFO prefix K/V at selected intermediate layers, performs parameter-free K-to-K retrieval with Frame-Gap Temporal Bias (FGTB), a fixed recency bias inspired by positional biases in NLP, to keep decisions present-dominant, and injects the retrieved context via pre-attention residual loading with norm-preserving rescaling to avoid distribution shift under frozen weights. On LIBERO-LONG, TempoFit improves strong pretrained backbones by up to +4.0% average success rate while maintaining near-real-time latency, and it transfers consistently to CALVIN and real-robot long-horizon tasks.

TempoFit: Plug-and-Play Layer-Wise Temporal KV Memory for Long-Horizon Vision-Language-Action Manipulation

Abstract

Pretrained Vision-Language-Action (VLA) policies have achieved strong single-step manipulation, but their inference remains largely memoryless, which is brittle in non-Markovian long-horizon settings with occlusion, state aliasing, and subtle post-action changes. Prior approaches inject history either by stacking frames, which scales visual tokens and latency while adding near-duplicate pixels, or by learning additional temporal interfaces that require (re-)training and may break the original single-frame inference graph. We present TempoFit, a training-free temporal retrofit that upgrades frozen VLAs through state-level memory. Our key insight is that prefix attention K/V already form a model-native, content-addressable runtime state; reusing them across timesteps introduces history without new tokens or trainable modules. TempoFit stores layer-wise FIFO prefix K/V at selected intermediate layers, performs parameter-free K-to-K retrieval with Frame-Gap Temporal Bias (FGTB), a fixed recency bias inspired by positional biases in NLP, to keep decisions present-dominant, and injects the retrieved context via pre-attention residual loading with norm-preserving rescaling to avoid distribution shift under frozen weights. On LIBERO-LONG, TempoFit improves strong pretrained backbones by up to +4.0% average success rate while maintaining near-real-time latency, and it transfers consistently to CALVIN and real-robot long-horizon tasks.
Paper Structure (17 sections, 7 equations, 3 figures, 4 tables)

This paper contains 17 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: TempoFit overview. At each timestep, TempoFit caches prefix $K/V$ at selected intermediate layers, retrieves relevant history via K-to-K matching with FGTB, and injects the retrieved context through pre-attention residual loading (optionally with norm-preserving rescaling), enabling training-free temporal retrofitting without expanding input context length.
  • Figure 2: TempoFit Pipeline. (a) In Layer-Wise FIFO KV Cache (see Sec. \ref{['sec:write']}), TempoFit caches prefix $K/V$ states at selected intermediate layers, preserving historical context without expanding the input token sequence. (b) In K-to-K Retrieval with FGTB (see Sec. \ref{['sec:kk_retrieval']}& \ref{['sec:fgtb']}), the module utilizes current keys to retrieve relevant historical features via address-space matching, applying a fixed Frame-Gap Temporal Bias (FGTB) to down-weight stale history and minimize interference. (c) Finally, via Norm-Preserving Residual Loading (see Sec. \ref{['sec:injection']}), the retrieved history is injected into the current state through a rescaled residual update, enabling the frozen backbone to generate temporally consistent actions without parameter updates.
  • Figure 3: Real-world evaluation on Realman RM-65B.Left. Hardware and multi-view sensing setup. Right. Quantitative success rates and/or qualitative rollouts on three long-horizon manipulation tasks.