Table of Contents
Fetching ...

An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

Georgios Pantazopoulos, Eda B. Özyiğit

TL;DR

This work tackles GUI grounding for agents by addressing data efficiency through a model-based data filtering pipeline and parameter-efficient fine-tuning. Starting from 4.8M synthetic GUI examples, a base VLM filters for difficulty, alignment, and diversity, yielding 12K high-quality demonstrations (compression ≈400x). A 3B parameter Vision-Language Model is trained under SFT, SFT with CoT, and GRPO with LoRA adapters, and evaluated across desktop, web, and mobile benchmarks. Results show that data quality and efficient adaptation can rival larger-scale training, making compact multimodal GUI reasoning agents practical for real-world tasks.

Abstract

Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface agents. Many existing methods rely on massive, noisy synthetic datasets. This work introduces an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. From 4.8M synthetic examples, 12K clean and diverse instances are curated by first identifying challenging cases, removing misaligned and then selecting a diverse set of multimodal instances. On this data, a 3B-parameter Vision-Language Model is trained under three regimes: supervised fine-tuning, chain-of-thought-augmented fine-tuning, and reinforcement learning via Group Relative Policy Optimization. Models trained with the filtered data and lightweight training strategies match or surpass larger baselines on benchmarks such as ScreenSpot, Multimodal-Mind2Web, and AndroidControl. These results demonstrate that principled data curation and robust adaptation can rival large-scale training, enabling compact yet capable multimodal reasoning agents.

An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

TL;DR

This work tackles GUI grounding for agents by addressing data efficiency through a model-based data filtering pipeline and parameter-efficient fine-tuning. Starting from 4.8M synthetic GUI examples, a base VLM filters for difficulty, alignment, and diversity, yielding 12K high-quality demonstrations (compression ≈400x). A 3B parameter Vision-Language Model is trained under SFT, SFT with CoT, and GRPO with LoRA adapters, and evaluated across desktop, web, and mobile benchmarks. Results show that data quality and efficient adaptation can rival larger-scale training, making compact multimodal GUI reasoning agents practical for real-world tasks.

Abstract

Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface agents. Many existing methods rely on massive, noisy synthetic datasets. This work introduces an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. From 4.8M synthetic examples, 12K clean and diverse instances are curated by first identifying challenging cases, removing misaligned and then selecting a diverse set of multimodal instances. On this data, a 3B-parameter Vision-Language Model is trained under three regimes: supervised fine-tuning, chain-of-thought-augmented fine-tuning, and reinforcement learning via Group Relative Policy Optimization. Models trained with the filtered data and lightweight training strategies match or surpass larger baselines on benchmarks such as ScreenSpot, Multimodal-Mind2Web, and AndroidControl. These results demonstrate that principled data curation and robust adaptation can rival large-scale training, enabling compact yet capable multimodal reasoning agents.

Paper Structure

This paper contains 43 sections, 9 figures, 12 tables.

Figures (9)

  • Figure 1: (top): Overview of our filtering approach. We begin with a pool of noisy, GUI examples drawn from desktop, web, and mobile interfaces. A base VLM scores candidates, allowing us to partition the pool into easy and challenging cases. We train a ranking model on the easy subset to decide whether an instruction aligns with a candidate region in the interface. The ranker is then applied to the challenging subset to retain only aligned instances. We then cluster the embeddings from a single forward pass of the VLM and select a diverse set of challenging examples. (bottom left): We integrate our grounding model into the SeeAct-V framework gou2025navigating, which uses screenshots as the only environmental observations and performs pixel-operations via a planner VLM. (bottom right): Grounding performance of ScreenSpot cheng2024seeclick. Our model significantly outperforms prior approaches relying on supervised fine-tuning over massive amounts of data and aligns with concurrent approaches combing RL+data filtering.
  • Figure 2: Performance of our ranking model against 3B/7B variants on ScreenSpot, ScreenSpotv2 and Osworld-G.
  • Figure 3:
  • Figure 4:
  • Figure 6: Reward curves for four different LoRA configurations.
  • ...and 4 more figures