Table of Contents
Fetching ...

UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions

Wenkang Han, Zhixiong Zeng, Jing Huang, Shu Jiang, Liming Zheng, Longrong Yang, Haibo Qiu, Chang Yao, Jingyuan Chen, Lin Ma

TL;DR

UITron-Speech introduces the first end-to-end GUI agent that directly processes speech instructions and on-device screenshots to predict user actions. It addresses data scarcity with random-speaker TTS-generated speech data and mitigates modality imbalance via mixed-modality grounding training, plus a training-free two-step grounding refinement to reduce localization errors. The framework is validated across multiple benchmarks (ScreenSpot, AndroidControl, GUI-Odyssey), showing competitive grounding accuracy and superior offline success rates, demonstrating the viability of speech-driven GUI agents for accessible and intelligent human-computer interaction. The work lays a foundation for hands-free GUI automation and provides datasets and code to advance research in speech-guided interaction with visual interfaces.

Abstract

Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this issue, we propose replacing text with speech as the instruction input modality for GUI agents, and introduce UITron-Speech, which is the first end-to-end GUI agent capable of directly processing speech instructions and on-device screenshots to predict user actions. To tackle the problem of data scarcity, we synthesize high-quality speech instruction datasets using a random-speaker text-to-speech model. Additionally, we design a mixed-modality training strategy to mitigate the inherent modality imbalance in pre-trained foundation models. Furthermore, we conduct a statistical analysis of the distribution of GUI grounding prediction errors and propose a training-free two-step grounding refinement method to alleviate minor localization deviations. Extensive experiments on multiple benchmarks demonstrate that UITron-Speech achieves robust performance and superior adaptability, underscoring the feasibility and potential of speech-driven GUI agents for more accessible and intelligent human-computer interaction. Our code and datasets are available at https://github.com/UITron-hub/UITron-Speech.

UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions

TL;DR

UITron-Speech introduces the first end-to-end GUI agent that directly processes speech instructions and on-device screenshots to predict user actions. It addresses data scarcity with random-speaker TTS-generated speech data and mitigates modality imbalance via mixed-modality grounding training, plus a training-free two-step grounding refinement to reduce localization errors. The framework is validated across multiple benchmarks (ScreenSpot, AndroidControl, GUI-Odyssey), showing competitive grounding accuracy and superior offline success rates, demonstrating the viability of speech-driven GUI agents for accessible and intelligent human-computer interaction. The work lays a foundation for hands-free GUI automation and provides datasets and code to advance research in speech-guided interaction with visual interfaces.

Abstract

Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this issue, we propose replacing text with speech as the instruction input modality for GUI agents, and introduce UITron-Speech, which is the first end-to-end GUI agent capable of directly processing speech instructions and on-device screenshots to predict user actions. To tackle the problem of data scarcity, we synthesize high-quality speech instruction datasets using a random-speaker text-to-speech model. Additionally, we design a mixed-modality training strategy to mitigate the inherent modality imbalance in pre-trained foundation models. Furthermore, we conduct a statistical analysis of the distribution of GUI grounding prediction errors and propose a training-free two-step grounding refinement method to alleviate minor localization deviations. Extensive experiments on multiple benchmarks demonstrate that UITron-Speech achieves robust performance and superior adaptability, underscoring the feasibility and potential of speech-driven GUI agents for more accessible and intelligent human-computer interaction. Our code and datasets are available at https://github.com/UITron-hub/UITron-Speech.

Paper Structure

This paper contains 24 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: UITron-Speech model architecture.
  • Figure 2: Grounding training stage prompt template.
  • Figure 3: GUI agent speech instruction dataset construction pipeline.
  • Figure 4: Two-step grounding refinement.
  • Figure 5: Comparison of text-based and speech-based GUI Agent evaluation results under different instruction lengths (char).