Table of Contents
Fetching ...

GPTVoiceTasker: Advancing Multi-step Mobile Task Efficiency Through Dynamic Interface Exploration and Learning

Minh Duc Vu, Han Wang, Zhuang Li, Jieshan Chen, Shengdong Zhao, Zhenchang Xing, Chunyang Chen

TL;DR

GPTVoiceTasker tackles the inefficiency and misinterpretation barrier of mobile voice assistants by combining LLM-driven command understanding with a dynamic, history-informed on-device execution framework. It blends unprecedented task exploration (collecting UI context, anonymising data, and prompting the LLM in a two-step process) with precedented task automation via a transition graph and semantic screen descriptions, enabling both novel and recurring tasks to be completed through voice. Key contributions include a hierarchical UI knowledge collection, privacy-preserving prompt design with Few-shot and Chain-of-Thought prompts, a shortest-path navigation engine, and a human-in-the-loop for continual refinement, all implemented on Android with GPT-4 and open-sourced. Empirical results show strong command parsing ($EM$ ≈ 84–85%), high multi-step task success (≈85.7%), and real-user studies reporting a ~34.85% gain in task efficiency and favorable usability, highlighting practical impact for accessibility and everyday task automation on mobile devices.

Abstract

Virtual assistants have the potential to play an important role in helping users achieves different tasks. However, these systems face challenges in their real-world usability, characterized by inefficiency and struggles in grasping user intentions. Leveraging recent advances in Large Language Models (LLMs), we introduce GptVoiceTasker, a virtual assistant poised to enhance user experiences and task efficiency on mobile devices. GptVoiceTasker excels at intelligently deciphering user commands and executing relevant device interactions to streamline task completion. The system continually learns from historical user commands to automate subsequent usages, further enhancing execution efficiency. Our experiments affirm GptVoiceTasker's exceptional command interpretation abilities and the precision of its task automation module. In our user study, GptVoiceTasker boosted task efficiency in real-world scenarios by 34.85%, accompanied by positive participant feedback. We made GptVoiceTasker open-source, inviting further research into LLMs utilization for diverse tasks through prompt engineering and leveraging user usage data to improve efficiency.

GPTVoiceTasker: Advancing Multi-step Mobile Task Efficiency Through Dynamic Interface Exploration and Learning

TL;DR

GPTVoiceTasker tackles the inefficiency and misinterpretation barrier of mobile voice assistants by combining LLM-driven command understanding with a dynamic, history-informed on-device execution framework. It blends unprecedented task exploration (collecting UI context, anonymising data, and prompting the LLM in a two-step process) with precedented task automation via a transition graph and semantic screen descriptions, enabling both novel and recurring tasks to be completed through voice. Key contributions include a hierarchical UI knowledge collection, privacy-preserving prompt design with Few-shot and Chain-of-Thought prompts, a shortest-path navigation engine, and a human-in-the-loop for continual refinement, all implemented on Android with GPT-4 and open-sourced. Empirical results show strong command parsing ( ≈ 84–85%), high multi-step task success (≈85.7%), and real-user studies reporting a ~34.85% gain in task efficiency and favorable usability, highlighting practical impact for accessibility and everyday task automation on mobile devices.

Abstract

Virtual assistants have the potential to play an important role in helping users achieves different tasks. However, these systems face challenges in their real-world usability, characterized by inefficiency and struggles in grasping user intentions. Leveraging recent advances in Large Language Models (LLMs), we introduce GptVoiceTasker, a virtual assistant poised to enhance user experiences and task efficiency on mobile devices. GptVoiceTasker excels at intelligently deciphering user commands and executing relevant device interactions to streamline task completion. The system continually learns from historical user commands to automate subsequent usages, further enhancing execution efficiency. Our experiments affirm GptVoiceTasker's exceptional command interpretation abilities and the precision of its task automation module. In our user study, GptVoiceTasker boosted task efficiency in real-world scenarios by 34.85%, accompanied by positive participant feedback. We made GptVoiceTasker open-source, inviting further research into LLMs utilization for diverse tasks through prompt engineering and leveraging user usage data to improve efficiency.
Paper Structure (36 sections, 1 equation, 6 figures, 4 tables)

This paper contains 36 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An example use case in Home Workout application when the user needs to interact with the smartphone hands-free due to physical busyness. When performing an unprecedented tasks (Section \ref{['sec:onScreenInteraction']}), GptVoiceTasker repeatedly predicts on-screen actions with current UI information and executes the response to achieve user tasks. The interactions collected during this process is then saved to streamline the execution of subsequent similar tasks (Section \ref{['sec:personalise']}).
  • Figure 2: An example of our prompt and response format to determine the most relevant target to press.
  • Figure 3: An example use case in Uber Eats to how GptVoiceTasker use the historical tasks to execute user new command. The system first locate the current screen and destination screen from the collected graph. After that, it identifies and execute the action sequence to traverse to the destination screen. Finally, we utilise feedback from users to improve subsequent execution.
  • Figure 4: The average time taken to complete each task using GptVoiceTasker and the baselines in seconds.
  • Figure 5: The comparison between GptVoiceTasker, Voicify, and Voice Access for A) the average cognitive load when using NASA-TLX form (lower is better) *: p < 0.01, **: p < 0.001 and B) Task 2 from the user evaluation with GptVoiceTasker and other baselines.
  • ...and 1 more figures