Table of Contents
Fetching ...

MUG: Interactive Multimodal Grounding on User Interfaces

Tao Li, Gang Li, Jingjie Zheng, Purple Wang, Yang Li

TL;DR

Mug introduces an interactive multimodal grounding task for mobile UIs, enabling iterative user-agent collaboration on a single screen. A large dataset (77,820 sequences across 7,132 apps) supports both offline and online evaluation, including human and automatic user models. The authors implement a Transformer-based UI encoder and a causal grounding decoder, and explore multiple agent and user model variants, including imitation and offline RL approaches. Results show that allowing multi-turn interaction substantially improves task completion (18% overall, 31% on challenging cases) and reveal robustness challenges that motivate future improvements in grounding, user modeling, and evaluation. The work provides a solid benchmark and demonstrates the value of interactive grounding for realistic UI understanding and accessibility scenarios.

Abstract

We present MUG, a novel interactive task for multimodal grounding where a user and an agent work collaboratively on an interface screen. Prior works modeled multimodal UI grounding in one round: the user gives a command and the agent responds to the command. Yet, in a realistic scenario, a user command can be ambiguous when the target action is inherently difficult to articulate in natural language. MUG allows multiple rounds of interactions such that upon seeing the agent responses, the user can give further commands for the agent to refine or even correct its actions. Such interaction is critical for improving grounding performances in real-world use cases. To investigate the problem, we create a new dataset that consists of 77,820 sequences of human user-agent interaction on mobile interfaces in which 20% involves multiple rounds of interactions. To establish our benchmark, we experiment with a range of modeling variants and evaluation strategies, including both offline and online evaluation-the online strategy consists of both human evaluation and automatic with simulators. Our experiments show that allowing iterative interaction significantly improves the absolute task completion by 18% over the entire test dataset and 31% over the challenging subset. Our results lay the foundation for further investigation of the problem.

MUG: Interactive Multimodal Grounding on User Interfaces

TL;DR

Mug introduces an interactive multimodal grounding task for mobile UIs, enabling iterative user-agent collaboration on a single screen. A large dataset (77,820 sequences across 7,132 apps) supports both offline and online evaluation, including human and automatic user models. The authors implement a Transformer-based UI encoder and a causal grounding decoder, and explore multiple agent and user model variants, including imitation and offline RL approaches. Results show that allowing multi-turn interaction substantially improves task completion (18% overall, 31% on challenging cases) and reveal robustness challenges that motivate future improvements in grounding, user modeling, and evaluation. The work provides a solid benchmark and demonstrates the value of interactive grounding for realistic UI understanding and accessibility scenarios.

Abstract

We present MUG, a novel interactive task for multimodal grounding where a user and an agent work collaboratively on an interface screen. Prior works modeled multimodal UI grounding in one round: the user gives a command and the agent responds to the command. Yet, in a realistic scenario, a user command can be ambiguous when the target action is inherently difficult to articulate in natural language. MUG allows multiple rounds of interactions such that upon seeing the agent responses, the user can give further commands for the agent to refine or even correct its actions. Such interaction is critical for improving grounding performances in real-world use cases. To investigate the problem, we create a new dataset that consists of 77,820 sequences of human user-agent interaction on mobile interfaces in which 20% involves multiple rounds of interactions. To establish our benchmark, we experiment with a range of modeling variants and evaluation strategies, including both offline and online evaluation-the online strategy consists of both human evaluation and automatic with simulators. Our experiments show that allowing iterative interaction significantly improves the absolute task completion by 18% over the entire test dataset and 31% over the challenging subset. Our results lay the foundation for further investigation of the problem.
Paper Structure (38 sections, 14 equations, 8 figures, 10 tables)

This paper contains 38 sections, 14 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Two illustrative examples of the Mug task. There are two turns in each of these examples. Interactions happen within a single screen. User commands are shown above the screens. The target object is bounded in Xygp . Agent choices are marked with Xygp .
  • Figure 2: Mug annotation interfaces consist of a user view and an agent view.
  • Figure 3: Mug examples 1-4. Instructions are at top of each turn. Agent selection is in Xygp and target is in Xygp .
  • Figure 4: Mug examples 5-8. Instructions are at top of each turn. Agent selection is in Xygp and target is in Xygp .
  • Figure 5: Completed examples by the Imitation agent following the instructions generated by the Heuristic user.
  • ...and 3 more figures