Table of Contents
Fetching ...

Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents

Zhou Xu, Bowen Zhou, Qi Wang, Shuwen Feng, Jingyu Xiao

TL;DR

GUIPruner is introduced, a training-free framework tailored for high-resolution GUI navigation that synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout.

Abstract

Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent's "fading memory" attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.

Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents

TL;DR

GUIPruner is introduced, a training-free framework tailored for high-resolution GUI navigation that synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout.

Abstract

Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent's "fading memory" attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.
Paper Structure (34 sections, 6 equations, 9 figures, 6 tables, 2 algorithms)

This paper contains 34 sections, 6 equations, 9 figures, 6 tables, 2 algorithms.

Figures (9)

  • Figure 1: Paradigm comparison of visual encoding and pruning. In contrast to conventional pipelines (Top) that incur high redundancy via uniform encoding and disrupt topology via unstructured pruning, GUIPruner (Bottom) implements efficient source-level reduction and structured, topology-preserving compression, ensuring high-precision grounding with minimal token consumption.
  • Figure 2: (a) Cross-attention weights exhibit a pronounced Temporal Decay, consistent with the "Recency Effect" in working memory. (b) Background tokens dominate the visual context across four datasets, indicating significant spatial redundancy. (c) Heatmaps reveal that specific background regions (red boxes) retain high attention as essential semantic anchors, cautioning against indiscriminate pruning.
  • Figure 3: Overview of the GUIPruner framework. The framework addresses spatiotemporal redundancy through two synergistic modules: (Left) Temporal-Adaptive Resolution (TAR) mimics biological fading memory by assigning decaying resolution budgets to historical frames based on temporal distance, eliminating redundancy in distant context. (Right) Stratified Structure-aware Pruning (SSP) operates on the current frame within shallow LLM layers. It preserves topological integrity by hierarchically retaining interactive foreground tokens ($S_{fg}$), semantic background anchors ($S_{bg}$), and a uniform structural grid ($S_{uni}$), effectively compressing visual tokens without inducing spatial hallucinations.
  • Figure 4: Decoupled sensitivity analysis of TAR and SSP on Qwen2-VL-2B. Left: Evaluation of TAR under varying history retention ratios ($\tau$) on AITW and Mind2Web. Right: Evaluation of SSP under varying current frame retention ratios ($\kappa$). GUIPruner consistently demonstrates superior robustness over baselines, particularly in high-compression regimes.
  • Figure 5: Hyperparameter sensitivity analysis on AITW. We evaluate the Step SR under varying compression intensities: the impact of the temporal decay factor $\gamma$ in TAR, and the impact of the background saliency factor $\rho$ in SSP.
  • ...and 4 more figures