GPTVoiceTasker: Advancing Multi-step Mobile Task Efficiency Through Dynamic Interface Exploration and Learning

Minh Duc Vu; Han Wang; Zhuang Li; Jieshan Chen; Shengdong Zhao; Zhenchang Xing; Chunyang Chen

GPTVoiceTasker: Advancing Multi-step Mobile Task Efficiency Through Dynamic Interface Exploration and Learning

Minh Duc Vu, Han Wang, Zhuang Li, Jieshan Chen, Shengdong Zhao, Zhenchang Xing, Chunyang Chen

TL;DR

GPTVoiceTasker tackles the inefficiency and misinterpretation barrier of mobile voice assistants by combining LLM-driven command understanding with a dynamic, history-informed on-device execution framework. It blends unprecedented task exploration (collecting UI context, anonymising data, and prompting the LLM in a two-step process) with precedented task automation via a transition graph and semantic screen descriptions, enabling both novel and recurring tasks to be completed through voice. Key contributions include a hierarchical UI knowledge collection, privacy-preserving prompt design with Few-shot and Chain-of-Thought prompts, a shortest-path navigation engine, and a human-in-the-loop for continual refinement, all implemented on Android with GPT-4 and open-sourced. Empirical results show strong command parsing ($EM$ ≈ 84–85%), high multi-step task success (≈85.7%), and real-user studies reporting a ~34.85% gain in task efficiency and favorable usability, highlighting practical impact for accessibility and everyday task automation on mobile devices.

Abstract

Virtual assistants have the potential to play an important role in helping users achieves different tasks. However, these systems face challenges in their real-world usability, characterized by inefficiency and struggles in grasping user intentions. Leveraging recent advances in Large Language Models (LLMs), we introduce GptVoiceTasker, a virtual assistant poised to enhance user experiences and task efficiency on mobile devices. GptVoiceTasker excels at intelligently deciphering user commands and executing relevant device interactions to streamline task completion. The system continually learns from historical user commands to automate subsequent usages, further enhancing execution efficiency. Our experiments affirm GptVoiceTasker's exceptional command interpretation abilities and the precision of its task automation module. In our user study, GptVoiceTasker boosted task efficiency in real-world scenarios by 34.85%, accompanied by positive participant feedback. We made GptVoiceTasker open-source, inviting further research into LLMs utilization for diverse tasks through prompt engineering and leveraging user usage data to improve efficiency.

GPTVoiceTasker: Advancing Multi-step Mobile Task Efficiency Through Dynamic Interface Exploration and Learning

TL;DR

≈ 84–85%), high multi-step task success (≈85.7%), and real-user studies reporting a ~34.85% gain in task efficiency and favorable usability, highlighting practical impact for accessibility and everyday task automation on mobile devices.

Abstract

Paper Structure (36 sections, 1 equation, 6 figures, 4 tables)

This paper contains 36 sections, 1 equation, 6 figures, 4 tables.

Introduction
Background & Related works
Voice Control & Automation on Mobile Devices
Large Language Models for Enhanced Human-AI Collaboration
The GptVoiceTasker System
Unprecedented Task Exploration
Data Collection Module
Private Information Anonymisation.
Prompt Creation
Action Executor
Precedented Task Automation
Transition Graph
Screen Description & Command Pattern Matching
Path Finding & Execution.
Human Feedback Loop
...and 21 more sections

Figures (6)

Figure 1: An example use case in Home Workout application when the user needs to interact with the smartphone hands-free due to physical busyness. When performing an unprecedented tasks (Section \ref{['sec:onScreenInteraction']}), GptVoiceTasker repeatedly predicts on-screen actions with current UI information and executes the response to achieve user tasks. The interactions collected during this process is then saved to streamline the execution of subsequent similar tasks (Section \ref{['sec:personalise']}).
Figure 2: An example of our prompt and response format to determine the most relevant target to press.
Figure 3: An example use case in Uber Eats to how GptVoiceTasker use the historical tasks to execute user new command. The system first locate the current screen and destination screen from the collected graph. After that, it identifies and execute the action sequence to traverse to the destination screen. Finally, we utilise feedback from users to improve subsequent execution.
Figure 4: The average time taken to complete each task using GptVoiceTasker and the baselines in seconds.
Figure 5: The comparison between GptVoiceTasker, Voicify, and Voice Access for A) the average cognitive load when using NASA-TLX form (lower is better) *: p < 0.01, **: p < 0.001 and B) Task 2 from the user evaluation with GptVoiceTasker and other baselines.
...and 1 more figures

GPTVoiceTasker: Advancing Multi-step Mobile Task Efficiency Through Dynamic Interface Exploration and Learning

TL;DR

Abstract

GPTVoiceTasker: Advancing Multi-step Mobile Task Efficiency Through Dynamic Interface Exploration and Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)