Table of Contents
Fetching ...

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

Lawrence Jang, Yinheng Li, Dan Zhao, Charles Ding, Justin Lin, Paul Pu Liang, Rogerio Bonatti, Kazuhito Koishida

TL;DR

VideoWebArena presents a large, open-source benchmark to evaluate long-context multimodal agents on video understanding within web-based tasks. It introduces a two-pronged task taxonomy (skill retention vs factual retention), 2,021 tasks, and 74 original videos across six domains, all in a reproducible, interactive environment. Baseline results with video-enabled models reveal substantial gaps to human performance, especially in factual retrieval and planning under long context, highlighting challenges in memory, grounding, and action grounding. The work provides a rigorous testbed for advancing long-context multimodal agents and offers detailed analysis of failure modes to guide future research.

Abstract

Videos are often used to learn or extract the necessary information to complete tasks in ways different than what text and static imagery alone can provide. However, many existing agent benchmarks neglect long-context video understanding, instead focusing on text or static image inputs. To bridge this gap, we introduce VideoWebArena (VideoWA), a benchmark for evaluating the capabilities of long-context multimodal agents for video understanding. VideoWA consists of 2,021 web agent tasks based on manually crafted video tutorials, which total almost four hours of content. For our benchmark, we define a taxonomy of long-context video-based agent tasks with two main areas of focus: skill retention and factual retention. While skill retention tasks evaluate whether an agent can use a given human demonstration to complete a task efficiently, the factual retention task evaluates whether an agent can retrieve instruction-relevant information from a video to complete a task. We find that the best model achieves 13.3% success on factual retention tasks and 45.8% on factual retention QA pairs, far below human performance at 73.9% and 79.3%, respectively. On skill retention tasks, long-context models perform worse with tutorials than without, exhibiting a 5% performance decrease in WebArena tasks and a 10.3% decrease in VisualWebArena tasks. Our work highlights the need to improve the agentic abilities of long-context multimodal models and provides a testbed for future development with long-context video agents.

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

TL;DR

VideoWebArena presents a large, open-source benchmark to evaluate long-context multimodal agents on video understanding within web-based tasks. It introduces a two-pronged task taxonomy (skill retention vs factual retention), 2,021 tasks, and 74 original videos across six domains, all in a reproducible, interactive environment. Baseline results with video-enabled models reveal substantial gaps to human performance, especially in factual retrieval and planning under long context, highlighting challenges in memory, grounding, and action grounding. The work provides a rigorous testbed for advancing long-context multimodal agents and offers detailed analysis of failure modes to guide future research.

Abstract

Videos are often used to learn or extract the necessary information to complete tasks in ways different than what text and static imagery alone can provide. However, many existing agent benchmarks neglect long-context video understanding, instead focusing on text or static image inputs. To bridge this gap, we introduce VideoWebArena (VideoWA), a benchmark for evaluating the capabilities of long-context multimodal agents for video understanding. VideoWA consists of 2,021 web agent tasks based on manually crafted video tutorials, which total almost four hours of content. For our benchmark, we define a taxonomy of long-context video-based agent tasks with two main areas of focus: skill retention and factual retention. While skill retention tasks evaluate whether an agent can use a given human demonstration to complete a task efficiently, the factual retention task evaluates whether an agent can retrieve instruction-relevant information from a video to complete a task. We find that the best model achieves 13.3% success on factual retention tasks and 45.8% on factual retention QA pairs, far below human performance at 73.9% and 79.3%, respectively. On skill retention tasks, long-context models perform worse with tutorials than without, exhibiting a 5% performance decrease in WebArena tasks and a 10.3% decrease in VisualWebArena tasks. Our work highlights the need to improve the agentic abilities of long-context multimodal models and provides a testbed for future development with long-context video agents.

Paper Structure

This paper contains 47 sections, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Overview of VideoWebArena. VideoWebArena is a visually grounded benchmark that tests the video understanding of agentic models across various realistic domains and environments, mirroring real-life tasks. All tasks require video input and consist of Q/A to test agentic abilities in video information retrieval, video understanding, and more.
  • Figure 2: Left: VideoWebArena Video Difficulty Task Distribution. Right: VideoWebArena Agent Difficulty Task Distribution.
  • Figure 4: VideoWebArena Baseline Agent Framework: We use 3 baseline agents: 1.) Video Summary Agent, where the video summary is fed in-context. 2.) Video Frame Agent, where a set number of frames and audio transcription is fed in-context. 3.) Video Agent, where the video is fed in as an .mov file in-context. The video information is put in-context along with the Set-of-Marks state representation to generate a singular action, following the multimodal SoM agent in VisualWebArena koh2024visualwebarena.
  • Figure 5: Dataset Creation Process A walkthrough of the VideoWebArena dataset creation. From 1641 existing tasks in WebArena and VisualWebArena, the authors grouped these tasks by their intent templates. For each intent template, the authors created a new video tutorial showing how to perform the tasks. For each video, the authors made at minimum 4 factual retention tasks. This led to 1641 skill retention and 400 factual retention tasks.
  • Figure 6: VideoWebArena Task Taxonomy We define a taxonomy for all the tasks in our benchmark, namely splitting them into a factual and skill retention groups. Under the factual retention group, there are 4 types of tasks: Visual Perception, Audio Perception, Full Video Understanding, and Temporal Reasoning.
  • ...and 1 more figures