Proactive Agents, Long-term User Context, VLM Annotation, Privacy Protection, Human-Computer Interaction
Yuanbo Tang, Huaze Tang, Tingyu Cao, Lam Nguyen, Anping Zhang, Xinwen Cao, Chunkang Liu, Wenbo Ding, Yang Li
TL;DR
ProAgentBench offers a rigorous, privacy-preserving benchmark for proactive AI agents operating in real-world workflows. By collecting 28,528 events over 500+ hours with long-term user context, the framework decomposes proactive assistance into When to Intervene and How to Intervene, and evaluates both LLMs and VLMs across memory-augmented and prompt-based baselines. Results show that long-term memory and authentic real-world data substantially improve both timing and content generation, with Knowledge Graph memory delivering the largest gains and real-world fine-tuning outperforming synthetic data. The work provides a practical foundation for context-aware, proactive agents that can integrate into human workflows while respecting privacy and ethical considerations.
Abstract
Proactive agents that anticipate user intentions without explicit prompts represent a significant evolution in human-AI interaction, promising to reduce cognitive load and streamline workflows. However, existing datasets suffer from two critical deficiencies: (1) reliance on LLM-synthesized data that fails to capture authentic human decision-making patterns, and (2) focus on isolated tasks rather than continuous workflows, missing the pre-assistance behavioral context essential for learning proactive intervention signals. To address these gaps, we introduce ProAgentBench, a rigorous benchmark for proactive agents in working scenarios. Our contributions include: (1) a hierarchical task framework that decomposes proactive assistance into timing prediction and assist content generation; (2) a privacy-compliant dataset with 28,000+ events from 500+ hours of real user sessions, preserving bursty interaction patterns (burstiness B=0.787) absent in synthetic data; and (3) extensive experiments that evaluates LLM- and VLM-based baselines. Numerically, we showed that long-term memory and historical context significantly enhance prediction accuracy, while real-world training data substantially outperforms synthetic alternatives. We release our dataset and code at https://anonymous.4open.science/r/ProAgentBench-6BC0.
