An Efficient Training Pipeline for Reasoning Graphical User Interface Agents
Georgios Pantazopoulos, Eda B. Özyiğit
TL;DR
This work tackles GUI grounding for agents by addressing data efficiency through a model-based data filtering pipeline and parameter-efficient fine-tuning. Starting from 4.8M synthetic GUI examples, a base VLM filters for difficulty, alignment, and diversity, yielding 12K high-quality demonstrations (compression ≈400x). A 3B parameter Vision-Language Model is trained under SFT, SFT with CoT, and GRPO with LoRA adapters, and evaluated across desktop, web, and mobile benchmarks. Results show that data quality and efficient adaptation can rival larger-scale training, making compact multimodal GUI reasoning agents practical for real-world tasks.
Abstract
Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface agents. Many existing methods rely on massive, noisy synthetic datasets. This work introduces an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. From 4.8M synthetic examples, 12K clean and diverse instances are curated by first identifying challenging cases, removing misaligned and then selecting a diverse set of multimodal instances. On this data, a 3B-parameter Vision-Language Model is trained under three regimes: supervised fine-tuning, chain-of-thought-augmented fine-tuning, and reinforcement learning via Group Relative Policy Optimization. Models trained with the filtered data and lightweight training strategies match or surpass larger baselines on benchmarks such as ScreenSpot, Multimodal-Mind2Web, and AndroidControl. These results demonstrate that principled data curation and robust adaptation can rival large-scale training, enabling compact yet capable multimodal reasoning agents.
