UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions
Wenkang Han, Zhixiong Zeng, Jing Huang, Shu Jiang, Liming Zheng, Longrong Yang, Haibo Qiu, Chang Yao, Jingyuan Chen, Lin Ma
TL;DR
UITron-Speech introduces the first end-to-end GUI agent that directly processes speech instructions and on-device screenshots to predict user actions. It addresses data scarcity with random-speaker TTS-generated speech data and mitigates modality imbalance via mixed-modality grounding training, plus a training-free two-step grounding refinement to reduce localization errors. The framework is validated across multiple benchmarks (ScreenSpot, AndroidControl, GUI-Odyssey), showing competitive grounding accuracy and superior offline success rates, demonstrating the viability of speech-driven GUI agents for accessible and intelligent human-computer interaction. The work lays a foundation for hands-free GUI automation and provides datasets and code to advance research in speech-guided interaction with visual interfaces.
Abstract
Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this issue, we propose replacing text with speech as the instruction input modality for GUI agents, and introduce UITron-Speech, which is the first end-to-end GUI agent capable of directly processing speech instructions and on-device screenshots to predict user actions. To tackle the problem of data scarcity, we synthesize high-quality speech instruction datasets using a random-speaker text-to-speech model. Additionally, we design a mixed-modality training strategy to mitigate the inherent modality imbalance in pre-trained foundation models. Furthermore, we conduct a statistical analysis of the distribution of GUI grounding prediction errors and propose a training-free two-step grounding refinement method to alleviate minor localization deviations. Extensive experiments on multiple benchmarks demonstrate that UITron-Speech achieves robust performance and superior adaptability, underscoring the feasibility and potential of speech-driven GUI agents for more accessible and intelligent human-computer interaction. Our code and datasets are available at https://github.com/UITron-hub/UITron-Speech.
