Table of Contents
Fetching ...

FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM Agents

Qinglong Yang, Haoming Li, Haotian Zhao, Xiaokai Yan, Jingtao Ding, Fengli Xu, Yong Li

Abstract

Mobile GUI agents are becoming critical tools to improve user experience on smart devices, with multimodal large language models (MLLMs) emerging as the dominant paradigms in this domain. Current agents, however, rely on explicit human instructions, overlooking the potential to leverage the contextual information (like location, time, user profile) and historical data for proactive task suggestions. Besides, previous works focus on optimizing the success rate during task execution, but pay less attention to the personalized execution trajectory, thereby neglecting potentially vast differences in user preferences. To address these challenges, we introduce the FingerTip 20K benchmark. We collected 20K unique human demonstrations of multi-step Android device interactions across a variety of everyday apps. These demonstrations are not isolated but are continuously acquired from the users' long-term usage in their real lives, and encompass essential user-related contextual information. The benchmark contains two new tracks: proactive task suggestions by analyzing environment observation and users' previous intents, and personalized task execution by catering to users' action preferences. Our experiments reveal that the tracks we propose pose significant challenges for leveraging user-related information in GUI tasks. We also performed a human study to show that there exists a huge gap between existing agents and humans. The model fine-tuned with the data we collected effectively utilized user information and achieved good results, highlighting the potential of our approach in building more user-oriented mobile LLM agents. Our code is open-source at https://github.com/tsinghua-fib-lab/FingerTip-20K for reproducibility.

FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM Agents

Abstract

Mobile GUI agents are becoming critical tools to improve user experience on smart devices, with multimodal large language models (MLLMs) emerging as the dominant paradigms in this domain. Current agents, however, rely on explicit human instructions, overlooking the potential to leverage the contextual information (like location, time, user profile) and historical data for proactive task suggestions. Besides, previous works focus on optimizing the success rate during task execution, but pay less attention to the personalized execution trajectory, thereby neglecting potentially vast differences in user preferences. To address these challenges, we introduce the FingerTip 20K benchmark. We collected 20K unique human demonstrations of multi-step Android device interactions across a variety of everyday apps. These demonstrations are not isolated but are continuously acquired from the users' long-term usage in their real lives, and encompass essential user-related contextual information. The benchmark contains two new tracks: proactive task suggestions by analyzing environment observation and users' previous intents, and personalized task execution by catering to users' action preferences. Our experiments reveal that the tracks we propose pose significant challenges for leveraging user-related information in GUI tasks. We also performed a human study to show that there exists a huge gap between existing agents and humans. The model fine-tuned with the data we collected effectively utilized user information and achieved good results, highlighting the potential of our approach in building more user-oriented mobile LLM agents. Our code is open-source at https://github.com/tsinghua-fib-lab/FingerTip-20K for reproducibility.

Paper Structure

This paper contains 44 sections, 2 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: An overview task example in FingerTip 20K. The agent proactively offers task suggestions to the user and personalizes the execution of tasks in a way that aligns with the user's preferences.
  • Figure 2: Demonstration of proactive task suggestion and personalized task execution.
  • Figure 2: The action space of an agent when interacting with a mobile phone environment.
  • Figure 3: Data collection pipeline. Users record their intents and demonstrate actions by using the FingerTip APP in their daily mobile phone usage.
  • Figure 4: Dataset statistics and distribution. (a) The length distribution of the natural language intents recorded by users. (b) The distribution of the number of screenshots contained in each episode (i.e., the distribution of the number of action steps of users). (c) The distribution of all categories to which the intents belong. (d) The distribution of all apps involved in the data.
  • ...and 3 more figures