Table of Contents
Fetching ...

Proactive Agents, Long-term User Context, VLM Annotation, Privacy Protection, Human-Computer Interaction

Yuanbo Tang, Huaze Tang, Tingyu Cao, Lam Nguyen, Anping Zhang, Xinwen Cao, Chunkang Liu, Wenbo Ding, Yang Li

TL;DR

ProAgentBench offers a rigorous, privacy-preserving benchmark for proactive AI agents operating in real-world workflows. By collecting 28,528 events over 500+ hours with long-term user context, the framework decomposes proactive assistance into When to Intervene and How to Intervene, and evaluates both LLMs and VLMs across memory-augmented and prompt-based baselines. Results show that long-term memory and authentic real-world data substantially improve both timing and content generation, with Knowledge Graph memory delivering the largest gains and real-world fine-tuning outperforming synthetic data. The work provides a practical foundation for context-aware, proactive agents that can integrate into human workflows while respecting privacy and ethical considerations.

Abstract

Proactive agents that anticipate user intentions without explicit prompts represent a significant evolution in human-AI interaction, promising to reduce cognitive load and streamline workflows. However, existing datasets suffer from two critical deficiencies: (1) reliance on LLM-synthesized data that fails to capture authentic human decision-making patterns, and (2) focus on isolated tasks rather than continuous workflows, missing the pre-assistance behavioral context essential for learning proactive intervention signals. To address these gaps, we introduce ProAgentBench, a rigorous benchmark for proactive agents in working scenarios. Our contributions include: (1) a hierarchical task framework that decomposes proactive assistance into timing prediction and assist content generation; (2) a privacy-compliant dataset with 28,000+ events from 500+ hours of real user sessions, preserving bursty interaction patterns (burstiness B=0.787) absent in synthetic data; and (3) extensive experiments that evaluates LLM- and VLM-based baselines. Numerically, we showed that long-term memory and historical context significantly enhance prediction accuracy, while real-world training data substantially outperforms synthetic alternatives. We release our dataset and code at https://anonymous.4open.science/r/ProAgentBench-6BC0.

Proactive Agents, Long-term User Context, VLM Annotation, Privacy Protection, Human-Computer Interaction

TL;DR

ProAgentBench offers a rigorous, privacy-preserving benchmark for proactive AI agents operating in real-world workflows. By collecting 28,528 events over 500+ hours with long-term user context, the framework decomposes proactive assistance into When to Intervene and How to Intervene, and evaluates both LLMs and VLMs across memory-augmented and prompt-based baselines. Results show that long-term memory and authentic real-world data substantially improve both timing and content generation, with Knowledge Graph memory delivering the largest gains and real-world fine-tuning outperforming synthetic data. The work provides a practical foundation for context-aware, proactive agents that can integrate into human workflows while respecting privacy and ethical considerations.

Abstract

Proactive agents that anticipate user intentions without explicit prompts represent a significant evolution in human-AI interaction, promising to reduce cognitive load and streamline workflows. However, existing datasets suffer from two critical deficiencies: (1) reliance on LLM-synthesized data that fails to capture authentic human decision-making patterns, and (2) focus on isolated tasks rather than continuous workflows, missing the pre-assistance behavioral context essential for learning proactive intervention signals. To address these gaps, we introduce ProAgentBench, a rigorous benchmark for proactive agents in working scenarios. Our contributions include: (1) a hierarchical task framework that decomposes proactive assistance into timing prediction and assist content generation; (2) a privacy-compliant dataset with 28,000+ events from 500+ hours of real user sessions, preserving bursty interaction patterns (burstiness B=0.787) absent in synthetic data; and (3) extensive experiments that evaluates LLM- and VLM-based baselines. Numerically, we showed that long-term memory and historical context significantly enhance prediction accuracy, while real-world training data substantially outperforms synthetic alternatives. We release our dataset and code at https://anonymous.4open.science/r/ProAgentBench-6BC0.
Paper Structure (76 sections, 13 equations, 10 figures, 6 tables)

This paper contains 76 sections, 13 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Illustration of Proactive Agent Workflow. The agent continuously monitors user screen activities and contextual signals. When assistance is needed, it proactively determines when to intervene and how to assist based on historical observations and user behavior patterns.
  • Figure 2: Temporal distributions and context relevance. We report total events, LLM events, and the LLM ratio across (a) weekdays and (b) hours of day, and (c) distribution of time-to-event for Top-1/3/5/10 nearest screenshots (log-log). Similarity computed using qwen2.5-vl-embedding.
  • Figure 3: Statistics of Human and LLM Synthesized Data
  • Figure 4: Data Collection Pipeline Overview. The figure illustrates the end-to-end data collection process, including screenshot capture, metadata synchronization, privacy filtering, and storage workflow.
  • Figure 5: Impact of Historical Context Length. We evaluate the performance of proactive assistance across different time window sizes (from 30s to 10m). (a) F1 score on the "When to Assist" task. (b) Intention accuracy on the "How to Assist" task.
  • ...and 5 more figures