Table of Contents
Fetching ...

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

Tongbo Chen, Zhengxi Lu, Zhan Xu, Guocheng Shao, Shaohan Zhao, Fei Tang, Yong Du, Kaitao Song, Yizhou Liu, Yuchen Yan, Wenqi Zhang, Xu Tan, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

Abstract

Personalized mobile agents that infer user preferences and calibrate proactive assistance hold great promise as everyday digital assistants, yet existing benchmarks fail to capture what this requires. Prior work evaluates preference recovery from static histories or intent prediction from fixed contexts. Neither tests whether an agent can elicit missing preferences through interaction, nor whether it can decide when to intervene, seek consent, or remain silent in a live GUI environment. We introduce KnowU-Bench, an online benchmark for personalized mobile agents built on a reproducible Android emulation environment, covering 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Unlike prior work that treats user preferences as static context, KnowU-Bench hides the user profile from the agent and exposes only behavioral logs, forcing genuine preference inference rather than context lookup. To support multi-turn preference elicitation, it instantiates an LLM-driven user simulator grounded in structured profiles, enabling realistic clarification dialogues and proactive consent handling. Beyond personalization, KnowU-Bench provides comprehensive evaluation of the complete proactive decision chain, including grounded GUI execution, consent negotiation, and post-rejection restraint, evaluated through a hybrid protocol combining rule-based verification with LLM-as-a-Judge scoring. Our experiments reveal a striking degradation: agents that excel at explicit task execution fall below 50% under vague instructions requiring user preference inference or intervention calibration, even for frontier models like Claude Sonnet 4.6. The core bottlenecks are not GUI navigation but preference acquisition and intervention calibration, exposing a fundamental gap between competent interface operation and trustworthy personal assistance.

KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

Abstract

Personalized mobile agents that infer user preferences and calibrate proactive assistance hold great promise as everyday digital assistants, yet existing benchmarks fail to capture what this requires. Prior work evaluates preference recovery from static histories or intent prediction from fixed contexts. Neither tests whether an agent can elicit missing preferences through interaction, nor whether it can decide when to intervene, seek consent, or remain silent in a live GUI environment. We introduce KnowU-Bench, an online benchmark for personalized mobile agents built on a reproducible Android emulation environment, covering 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Unlike prior work that treats user preferences as static context, KnowU-Bench hides the user profile from the agent and exposes only behavioral logs, forcing genuine preference inference rather than context lookup. To support multi-turn preference elicitation, it instantiates an LLM-driven user simulator grounded in structured profiles, enabling realistic clarification dialogues and proactive consent handling. Beyond personalization, KnowU-Bench provides comprehensive evaluation of the complete proactive decision chain, including grounded GUI execution, consent negotiation, and post-rejection restraint, evaluated through a hybrid protocol combining rule-based verification with LLM-as-a-Judge scoring. Our experiments reveal a striking degradation: agents that excel at explicit task execution fall below 50% under vague instructions requiring user preference inference or intervention calibration, even for frontier models like Claude Sonnet 4.6. The core bottlenecks are not GUI navigation but preference acquisition and intervention calibration, exposing a fundamental gap between competent interface operation and trustworthy personal assistance.

Paper Structure

This paper contains 56 sections, 5 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Left: Model performance drops substantially from clear to vague instructions. Right: Key components of KnowU-Bench.
  • Figure 2: Overview of the KnowU-Bench framework. The benchmark couples a reproducible environment module, a GUI agent, an online user simulator grounded in user profiles and logs, and a hybrid evaluation pipeline combining rule based checks with LLM-as-a-judge scoring.
  • Figure 3: Visualization analyses. (a) Average score across four user roles: Developer (Dev.), Grandma (Grand.), Student (Stud.), and Researcher (Res.). (b) Personalized interaction metrics, including Efficiency (defined as $50/\text{Avg.\ Steps}$), Average Queries, and Interaction Efficiency (IE). (c) Proactive safety rates, including Act, Silent, and Stop.
  • Figure 4: Judge sensitivity against human ratings. Task-level scatter plots comparing two automatic evaluators against the mean score of four human experts on 26 shared trajectories. Each point denotes one task, the dashed diagonal indicates perfect agreement, and the inset reports mean absolute error. The hybrid judge (LLM-as-a-judge combined with rule-based scoring) exhibits tighter clustering around the diagonal and lower error than the pure rule-based variant, confirming stronger alignment with human judgment.
  • Figure 5: Failure mode breakdown. (a) Personalized failures are categorized into Clarify (insufficient clarification), Partial (partial preference satisfaction), Preference (preference misidentification), and GUI (GUI navigation failure). Most failures come from Clarify and Partial. (b) Proactive failures are categorized into Intervention (unwarranted intervention), Passive (false passivity), GUI (GUI navigation failure), and Rejection (post-rejection violation).
  • ...and 12 more figures