Table of Contents
Fetching ...

InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization

Yuhang Liu, Zeyu Liu, Shuanghe Zhu, Pengxiang Li, Congkai Xie, Jiasheng Wang, Xavier Hu, Xiaotian Han, Jianbo Yuan, Xinyao Wang, Shengyu Zhang, Hongxia Yang, Fei Wu

TL;DR

The paper targets robust GUI grounding for autonomous agents operating on visual GUIs, identifying semantic alignment as the key challenge alongside spatial precision. It introduces Adaptive Exploration Policy Optimization (AEPO), which combines multi-answer generation with an Adaptive Exploration Reward to overcome exploration bottlenecks in RLVR. The InfiGUI-G1 models (3B and 7B) achieve state-of-the-art results across five GUI grounding benchmarks with notable data efficiency, trained on ~44k samples. Analyses show AEPO adapts exploration to task difficulty and delivers learning signals for hard-to-explore samples, offering a practical advance for semantically accurate GUI agents.

Abstract

The emergence of Multimodal Large Language Models (MLLMs) has propelled the development of autonomous agents that operate on Graphical User Interfaces (GUIs) using pure visual input. A fundamental challenge is robustly grounding natural language instructions. This requires a precise spatial alignment, which accurately locates the coordinates of each element, and, more critically, a correct semantic alignment, which matches the instructions to the functionally appropriate UI element. Although Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be effective at improving spatial alignment for these MLLMs, we find that inefficient exploration bottlenecks semantic alignment, which prevent models from learning difficult semantic associations. To address this exploration problem, we present Adaptive Exploration Policy Optimization (AEPO), a new policy optimization framework. AEPO employs a multi-answer generation strategy to enforce broader exploration, which is then guided by a theoretically grounded Adaptive Exploration Reward (AER) function derived from first principles of efficiency eta=U/C. Our AEPO-trained models, InfiGUI-G1-3B and InfiGUI-G1-7B, establish new state-of-the-art results across multiple challenging GUI grounding benchmarks, achieving significant relative improvements of up to 9.0% against the naive RLVR baseline on benchmarks designed to test generalization and semantic understanding. Resources are available at https://github.com/InfiXAI/InfiGUI-G1.

InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization

TL;DR

The paper targets robust GUI grounding for autonomous agents operating on visual GUIs, identifying semantic alignment as the key challenge alongside spatial precision. It introduces Adaptive Exploration Policy Optimization (AEPO), which combines multi-answer generation with an Adaptive Exploration Reward to overcome exploration bottlenecks in RLVR. The InfiGUI-G1 models (3B and 7B) achieve state-of-the-art results across five GUI grounding benchmarks with notable data efficiency, trained on ~44k samples. Analyses show AEPO adapts exploration to task difficulty and delivers learning signals for hard-to-explore samples, offering a practical advance for semantically accurate GUI agents.

Abstract

The emergence of Multimodal Large Language Models (MLLMs) has propelled the development of autonomous agents that operate on Graphical User Interfaces (GUIs) using pure visual input. A fundamental challenge is robustly grounding natural language instructions. This requires a precise spatial alignment, which accurately locates the coordinates of each element, and, more critically, a correct semantic alignment, which matches the instructions to the functionally appropriate UI element. Although Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be effective at improving spatial alignment for these MLLMs, we find that inefficient exploration bottlenecks semantic alignment, which prevent models from learning difficult semantic associations. To address this exploration problem, we present Adaptive Exploration Policy Optimization (AEPO), a new policy optimization framework. AEPO employs a multi-answer generation strategy to enforce broader exploration, which is then guided by a theoretically grounded Adaptive Exploration Reward (AER) function derived from first principles of efficiency eta=U/C. Our AEPO-trained models, InfiGUI-G1-3B and InfiGUI-G1-7B, establish new state-of-the-art results across multiple challenging GUI grounding benchmarks, achieving significant relative improvements of up to 9.0% against the naive RLVR baseline on benchmarks designed to test generalization and semantic understanding. Resources are available at https://github.com/InfiXAI/InfiGUI-G1.

Paper Structure

This paper contains 24 sections, 4 equations, 3 figures, 8 tables, 1 algorithm.

Figures (3)

  • Figure 1: Primary GUI‐grounding failure modes. (a) Spatial‐alignment failure: the model selects the correct icon but localizes it imprecisely. (b) Semantic‐alignment failure: the model localizes precisely on an incorrect icon due to misinterpreting the instruction. Although RLVR methods have advanced spatial alignment, semantic alignment remains the critical bottleneck for complex GUI tasks—this work is devoted to addressing it.
  • Figure 2: Visualization of the AER function based on the efficiency ratio $\eta = U/C$. (a) The reward curve increases nonlinearly to strongly incentivize selection of the correct answer, i.e., lower rank $k$. (b) The AER dynamically balances exploration and exploitation: successful trials (green/blue curves) receive higher reward for greater efficiency (smaller candidate set $N$), whereas failures (red curve) incur diminishing penalties to promote broader exploration.
  • Figure 3: Comparison of AEPO and a naive RL baseline. Top: The naive single-answer approach becomes trapped on high-confidence errors, repeatedly sampling the same incorrect action and producing a vanishing learning signal when no positive reward is discovered. Bottom:AEPO employs multi-answer generation to explore diverse candidates each rollout and an AER to derive an informative learning signal from their efficiency and correctness. These mechanisms break the exploration bottleneck in GUI agents and enable robust semantic alignment.