Table of Contents
Fetching ...

DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference

Yuquan Li, Lianjie Ma, Han Ding, Lijun Zhu

TL;DR

DepthCache is a training-free framework that leverages depth as a structural prior for visual token compression, partitions observations into depth-based regions and applies spatially differentiated merge ratios, preserving the near-field workspace while compressing the distant background.

Abstract

Vision-Language-Action (VLA) models enable generalist robotic manipulation but suffer from high inference latency. This bottleneck stems from the massive number of visual tokens processed by large language backbones. Existing methods either prune or merge tokens uniformly, degrading the spatial reasoning essential for robotic control. We present DepthCache, a training-free framework that leverages depth as a structural prior for visual token compression. It partitions observations into depth-based regions and applies spatially differentiated merge ratios, preserving the near-field workspace while compressing the distant background. To exploit temporal redundancy, DepthCache distributes the merging process across consecutive frames, ensuring consistent representations while reducing per-step computation. A motion-adaptive pipeline further optimizes auxiliary view compression based on end-effector dynamics. The framework requires no model modification, generalizing across diverse VLA architectures. On the LIBERO benchmark, DepthCache achieves up to 1.28x inference speedup with less than 1% average success rate degradation across three VLA models (pi_0.5, OpenVLA, GR00T), whereas pruning and merging baselines incur 4--24% degradation at comparable compression. Real-world experiments on a physical manipulator demonstrate that DepthCache enables faster task throughput and more responsive closed-loop control in latency-sensitive scenarios.

DepthCache: Depth-Guided Training-Free Visual Token Merging for Vision-Language-Action Model Inference

TL;DR

DepthCache is a training-free framework that leverages depth as a structural prior for visual token compression, partitions observations into depth-based regions and applies spatially differentiated merge ratios, preserving the near-field workspace while compressing the distant background.

Abstract

Vision-Language-Action (VLA) models enable generalist robotic manipulation but suffer from high inference latency. This bottleneck stems from the massive number of visual tokens processed by large language backbones. Existing methods either prune or merge tokens uniformly, degrading the spatial reasoning essential for robotic control. We present DepthCache, a training-free framework that leverages depth as a structural prior for visual token compression. It partitions observations into depth-based regions and applies spatially differentiated merge ratios, preserving the near-field workspace while compressing the distant background. To exploit temporal redundancy, DepthCache distributes the merging process across consecutive frames, ensuring consistent representations while reducing per-step computation. A motion-adaptive pipeline further optimizes auxiliary view compression based on end-effector dynamics. The framework requires no model modification, generalizing across diverse VLA architectures. On the LIBERO benchmark, DepthCache achieves up to 1.28x inference speedup with less than 1% average success rate degradation across three VLA models (pi_0.5, OpenVLA, GR00T), whereas pruning and merging baselines incur 4--24% degradation at comparable compression. Real-world experiments on a physical manipulator demonstrate that DepthCache enables faster task throughput and more responsive closed-loop control in latency-sensitive scenarios.
Paper Structure (21 sections, 4 equations, 5 figures, 4 tables)

This paper contains 21 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: DepthCache enables training-free inference acceleration for VLA models with negligible performance loss. Top: speedup and success rate on LIBERO across three VLA architectures. Bottom: directly transplanting token merging (ToSA, left) causes hesitation during grasping; DepthCache (right) maintains fluid execution with reduced latency.
  • Figure 2: Overview of DepthCache. The primary view pipeline (top) forms a cyclic process: initialization computes dual protection sets, depth-based partitioning assigns differentiated merge ratios, and progressive merging reduces tokens across frames until scene change triggers re-initialization. The auxiliary view pipeline (bottom) employs a two-state machine to gate wrist camera compression by end-effector motion.
  • Figure 3: Real-world core task execution with DepthCache-augmented $\pi_{0.5}$. Perturbation recovery is shown in Fig. \ref{['fig:teaser']}.
  • Figure 4: Sequential multi-object sorting comparison. Top: baseline $\pi_{0.5}$. Bottom: $\pi_{0.5}$ + DepthCache. The time gap accumulates across successive pick-and-place cycles.
  • Figure 5: Parameter sensitivity analysis on $\pi_{0.5}$ (LIBERO). (a) Effect of maximum merge ratio $r_{\max}$ on success rate and speedup. (b) Effect of progressive merge step ratio $\eta$ on success rate (log-scale $x$-axis). Default values ($r_{\max}=0.7$, $\eta=0.2$) are highlighted.