Table of Contents
Fetching ...

Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression

Bowen Zhou, Zhou Xu, Wanli Li, Jingyu Xiao, Haoqian Wang

TL;DR

ST-Lite is proposed, a training-free KV cache compression framework tailored for efficient GUI agents that explicitly addresses the dynamic spatio-trajectory dependencies within GUI data streams and introduces a novel dual-branch scoring policy incorporating Component-centric Spatial Saliency and Trajectory-aware Semantic Gating.

Abstract

Large Vision-Language Models (VLMs) have emerged as powerful engines for autonomous GUI agents, yet their deployment is severely constrained by the substantial memory footprint and latency of the Key-Value (KV) cache during long-horizon interactions. While existing cache compression methods have proven effective for LLMs, we empirically demonstrate that they suffer from suboptimal performance in GUI scenarios due to a fundamental misalignment: unlike general visual tasks where attention sparsity varies across layers, GUI attention patterns exhibit uniform high-sparsity across all transformer layers. Motivated by this insight, we propose ST-Lite, a training-free KV cache compression framework tailored for efficient GUI agents that explicitly addresses the dynamic spatio-trajectory dependencies within GUI data streams. ST-Lite introduces a novel dual-branch scoring policy incorporating Component-centric Spatial Saliency (CSS) and Trajectory-aware Semantic Gating (TSG). Specifically, CSS preserves the structural integrity of interactive UI elements by evaluating local neighborhood saliency, while TSG mitigates historical redundancy by dynamically filtering visually repetitive KV pairs within the interaction trajectory. Extensive evaluations demonstrate that with only a 10-20% cache budget, ST-Lite achieves a 2.45x decoding acceleration while maintaining comparable or even superior performance compared to full-cache baselines, offering a scalable solution for resource-constrained GUI agents.

Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression

TL;DR

ST-Lite is proposed, a training-free KV cache compression framework tailored for efficient GUI agents that explicitly addresses the dynamic spatio-trajectory dependencies within GUI data streams and introduces a novel dual-branch scoring policy incorporating Component-centric Spatial Saliency and Trajectory-aware Semantic Gating.

Abstract

Large Vision-Language Models (VLMs) have emerged as powerful engines for autonomous GUI agents, yet their deployment is severely constrained by the substantial memory footprint and latency of the Key-Value (KV) cache during long-horizon interactions. While existing cache compression methods have proven effective for LLMs, we empirically demonstrate that they suffer from suboptimal performance in GUI scenarios due to a fundamental misalignment: unlike general visual tasks where attention sparsity varies across layers, GUI attention patterns exhibit uniform high-sparsity across all transformer layers. Motivated by this insight, we propose ST-Lite, a training-free KV cache compression framework tailored for efficient GUI agents that explicitly addresses the dynamic spatio-trajectory dependencies within GUI data streams. ST-Lite introduces a novel dual-branch scoring policy incorporating Component-centric Spatial Saliency (CSS) and Trajectory-aware Semantic Gating (TSG). Specifically, CSS preserves the structural integrity of interactive UI elements by evaluating local neighborhood saliency, while TSG mitigates historical redundancy by dynamically filtering visually repetitive KV pairs within the interaction trajectory. Extensive evaluations demonstrate that with only a 10-20% cache budget, ST-Lite achieves a 2.45x decoding acceleration while maintaining comparable or even superior performance compared to full-cache baselines, offering a scalable solution for resource-constrained GUI agents.
Paper Structure (30 sections, 12 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 30 sections, 12 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Conceptual illustration of ST-Lite compared with existing methods. The left side depicts the limitations of current strategies: window-based greedy methods (e.g., SnapKV) suffer from local optimum traps, while hierarchical allocation methods (e.g., PyramidKV) lead to structure misalignment. The right side demonstrates our ST-Lite, which integrates Component-centric Spatial Saliency (CSS) to preserve key interactive elements and Trajectory-aware Semantic Gating (TSG) to eliminate history redundancy, achieving efficient and accurate long-horizon reasoning.
  • Figure 2: Comparison of task execution flows under different context settings. (a) The original interaction sequence $\mathcal{O}$ containing full historical frames leads to successful task completion. (b) The pruned sequence $\mathcal{O}_r$, despite removing trajectory-wise redundant frames, also results in success. This visual evidence confirms that high historical redundancy exists in GUI tasks and that our aggressive compression preserves the essential semantic cues required for correct agent decision-making.
  • Figure 3: Layer-wise Attention Sparsity Analysis. Unlike the hierarchical sparsity variation observed in LLMs and general vision models, GUI agents exhibit a uniform high-sparsity pattern across all transformer layers on both (a) AITW and (b) AgentNetBench datasets. This justifies our uniform budget allocation strategy.
  • Figure 4: The overall architecture of ST-Lite. Our framework dynamically optimizes the KV cache through two synergistic modules: (1) Component-centric Spatial Saliency (CSS), which identifies and preserves spatially salient regions (e.g., functional buttons) within each frame using attention heatmap analysis; and (2) Trajectory-aware Semantic Gating (TSG) , which filters out redundant historical states by measuring semantic shifts between consecutive frames. By integrating these spatial and historical policies, ST-Lite effectively reduces memory footprint while maintaining high precision in long-horizon GUI tasks.
  • Figure 5: Evaluation results on ScreenSpot Pro, AITW, and AgentNetBench with varied cache budgets. ST-Lite achieves comparable accuracy against Full Cache and outperforms multiple baselines with limited KV cache budget. Interestingly, we found that ST-Lite occasionally performs slightly better with a partial KV cache (e.g., on AITW). We attribute it to the regularization effect of KV cache compression.
  • ...and 3 more figures