Table of Contents
Fetching ...

RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users

Suyu Ye, Haojun Shi, Darren Shih, Hyokun Yun, Tanya Roosta, Tianmin Shu

TL;DR

RealWebAssist introduces the first benchmark for evaluating long-horizon web assistance with real users, capturing sequential instructions that evolve over time across multiple websites. It combines open-ended user goals with GUI-grounded actions and offline evaluation using annotated click regions, providing a realistic, challenging testbed for grounding, reasoning, and planning in web contexts. Experimental results show that state-of-the-art GUI grounding, VLMs, and LRMs struggle to follow real-world, multi-step instructions, with large performance gaps relative to human performance; combining reasoning models with grounding helps but does not close the gap, and fine-tuning on real data offers limited gains. The benchmark reveals key challenges—spatial/temporal reasoning, multi-step planning, and learning user-specific routines—guiding future research toward more capable long-horizon web agents and richer real-user datasets.

Abstract

To achieve successful assistance with long-horizon web-based tasks, AI agents must be able to sequentially follow real-world user instructions over a long period. Unlike existing web-based agent benchmarks, sequential instruction following in the real world poses significant challenges beyond performing a single, clearly defined task. For instance, real-world human instructions can be ambiguous, require different levels of AI assistance, and may evolve over time, reflecting changes in the user's mental state. To address this gap, we introduce RealWebAssist, a novel benchmark designed to evaluate sequential instruction-following in realistic scenarios involving long-horizon interactions with the web, visual GUI grounding, and understanding ambiguous real-world user instructions. RealWebAssist includes a dataset of sequential instructions collected from real-world human users. Each user instructs a web-based assistant to perform a series of tasks on multiple websites. A successful agent must reason about the true intent behind each instruction, keep track of the mental state of the user, understand user-specific routines, and ground the intended tasks to actions on the correct GUI elements. Our experimental results show that state-of-the-art models struggle to understand and ground user instructions, posing critical challenges in following real-world user instructions for long-horizon web assistance.

RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users

TL;DR

RealWebAssist introduces the first benchmark for evaluating long-horizon web assistance with real users, capturing sequential instructions that evolve over time across multiple websites. It combines open-ended user goals with GUI-grounded actions and offline evaluation using annotated click regions, providing a realistic, challenging testbed for grounding, reasoning, and planning in web contexts. Experimental results show that state-of-the-art GUI grounding, VLMs, and LRMs struggle to follow real-world, multi-step instructions, with large performance gaps relative to human performance; combining reasoning models with grounding helps but does not close the gap, and fine-tuning on real data offers limited gains. The benchmark reveals key challenges—spatial/temporal reasoning, multi-step planning, and learning user-specific routines—guiding future research toward more capable long-horizon web agents and richer real-user datasets.

Abstract

To achieve successful assistance with long-horizon web-based tasks, AI agents must be able to sequentially follow real-world user instructions over a long period. Unlike existing web-based agent benchmarks, sequential instruction following in the real world poses significant challenges beyond performing a single, clearly defined task. For instance, real-world human instructions can be ambiguous, require different levels of AI assistance, and may evolve over time, reflecting changes in the user's mental state. To address this gap, we introduce RealWebAssist, a novel benchmark designed to evaluate sequential instruction-following in realistic scenarios involving long-horizon interactions with the web, visual GUI grounding, and understanding ambiguous real-world user instructions. RealWebAssist includes a dataset of sequential instructions collected from real-world human users. Each user instructs a web-based assistant to perform a series of tasks on multiple websites. A successful agent must reason about the true intent behind each instruction, keep track of the mental state of the user, understand user-specific routines, and ground the intended tasks to actions on the correct GUI elements. Our experimental results show that state-of-the-art models struggle to understand and ground user instructions, posing critical challenges in following real-world user instructions for long-horizon web assistance.

Paper Structure

This paper contains 26 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: An example sequential instruction following task with a real-world user. The red circles indicate the correct actions based on the user's spoken instructions. Sequential instructions introduce unique challenges, such as the need to retain and reason over past context. For instance, the instruction in step 3 requires information from step 1 to be correctly interpreted.
  • Figure 2: Examples of general task categories (left) and websites visited (right) in RealWebAssist. The tasks span a wide range of real-world scenarios, from shopping to food & entertainment to travel planning, which encourages users to visit many different websites.
  • Figure 3: Multiple actions can satisfy a user’s intent. A web agent's action is considered correct if the coordinate they provide is within one of the annotated correct regions.
  • Figure 4: Key challenges introduced by RealWebAssist: (A) spatial reasoning, (B) temporal reasoning, (C) multi-step planning, and (D) learning user-specific routines.
  • Figure 5: Qualitative results. The captions show instructions generated by o3 (the best LRM). (A) Error corrected by using o3 to convert instructions. (B) Failure caused by GTA-1 when o3 reasons correctly. (C) Reasoning failure caused by o3.
  • ...and 4 more figures