Table of Contents
Fetching ...

OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism

Xiangyu Li, Huaizhi Tang, Xin Ding, Weijun Wang, Ting Cao, Yunxin Liu

Abstract

Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment due to redundant computation and resource contention. We identify isolated KV cache management as the root cause. To address this, we propose unified KV cache management, an inference paradigm that treats KV cache as a first-class shared resource across tasks and over time. This abstraction enables two key optimizations: cross-task KV sharing eliminates redundant prefill of shared observations, while cross-frame continuous batching decouples variable-length language decoding from fixed-rate action generation across control cycles. We implement this paradigm for $π_{0.5}$, the most popular MoT VLA, and evaluate under representative robotic configurations. OxyGen achieves up to 3.7$\times$ speedup over isolated execution, delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously without action quality degradation.

OxyGen: Unified KV Cache Management for Vision-Language-Action Models under Multi-Task Parallelism

Abstract

Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment due to redundant computation and resource contention. We identify isolated KV cache management as the root cause. To address this, we propose unified KV cache management, an inference paradigm that treats KV cache as a first-class shared resource across tasks and over time. This abstraction enables two key optimizations: cross-task KV sharing eliminates redundant prefill of shared observations, while cross-frame continuous batching decouples variable-length language decoding from fixed-rate action generation across control cycles. We implement this paradigm for , the most popular MoT VLA, and evaluate under representative robotic configurations. OxyGen achieves up to 3.7 speedup over isolated execution, delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously without action quality degradation.
Paper Structure (32 sections, 10 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 32 sections, 10 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Left: An example of deploying a Mixture-of-Transformers (MoT) Vision-Language-Action (VLA) model for parallel multi-task inference: based on per-frame input observations, the VLA generates robot actions within each frame, while continuously generating language-based memories during multiple frames torne2025mem. Right: Comparison between two paradigms of MoT VLA inference: existing systems manages KV cache in isolation, slowing down inference due to redundant computation and resource contention; Our method adopts a unified KV cache management, achieving up to 3.7$\times$ speedup via cross-task KV sharing and cross-frame continuous batching.
  • Figure 2: KV-centric dataflow at frame $t$ with unified KV cache manager. KV[t] represents KV cache prefilled at frame $t$ (, $\mathcal{K}_t$ defined in \ref{['eq:kv-cache']}); $\Delta$Language[t] represents incremental language tokens in $\mathbf{y}_t$, generated with $\mathcal{K}_t$.
  • Figure 3: Timeline comparison of OxyGen vs. isolated execution (baseline), with an example workload of $N=12$ total tokens per request. After the initial warmup, OxyGen steadily advances $B=3$ parallel requests to produce $k=4$ tokens per request per frame, significantly reducing the end-to-end inference latency per frame, and increasing both action frequency and language throughput, all by a factor of $1 + \Delta T / T$.
  • Figure 4: Comparison of action frequency and language throughput under different configurations. OxyGen consistently outperforms baselines by up to $3.7\times$, achieving up to 200 tokens/s language throughput and 70 Hz action frequency simultaneously.
  • Figure 5: Speedup ratio for action frequency of OxyGen vs. baseline under different configurations. Action denoising steps have modest impact on the speedup ratio.
  • ...and 2 more figures