Table of Contents
Fetching ...

WebQuest: A Benchmark for Multimodal QA on Web Page Sequences

Maria Wang, Srinivas Sunkara, Gilles Baechler, Jason Lin, Yun Zhu, Fedir Zubach, Lei Shu, Jindong Chen

TL;DR

WebQuest introduces a multimodal benchmark for QA across sequences of web pages, addressing a gap where users increasingly rely on cross-site information retrieval. It defines three QA categories—single-screen, multi-screen, and trace QA—to evaluate information extraction, multimodal retrieval, and navigation-based reasoning using both proprietary and open-source models. The study reveals a pronounced gap between single-screen and multi-screen reasoning, with Chain-of-Thought prompting providing notable gains, especially in multi-screen and trace tasks. The work highlights practical implications for developing robust web agents and suggests directions like OCR, DOM-aware representations, and interactive multi-turn dialogues to advance cross-page UI reasoning.

Abstract

The rise of powerful multimodal LLMs has enhanced the viability of building web agents which can, with increasing levels of autonomy, assist users to retrieve information and complete tasks on various human-computer interfaces. It is hence necessary to build challenging benchmarks that span a wide-variety of use cases reflecting real-world usage. In this work, we present WebQuest, a multi-page question-answering dataset that requires reasoning across multiple related web pages. In contrast to existing UI benchmarks that focus on multi-step web navigation and task completion, our dataset evaluates information extraction, multimodal retrieval and composition of information from many web pages. WebQuest includes three question categories: single-screen QA, multi-screen QA, and QA based on navigation traces. We evaluate leading proprietary multimodal models like GPT-4V, Gemini Flash, Claude 3, and open source models like InstructBLIP, PaliGemma on our dataset, revealing a significant gap between single-screen and multi-screen reasoning. Finally, we investigate inference time techniques like Chain-of-Thought prompting to improve model capabilities on multi-screen reasoning.

WebQuest: A Benchmark for Multimodal QA on Web Page Sequences

TL;DR

WebQuest introduces a multimodal benchmark for QA across sequences of web pages, addressing a gap where users increasingly rely on cross-site information retrieval. It defines three QA categories—single-screen, multi-screen, and trace QA—to evaluate information extraction, multimodal retrieval, and navigation-based reasoning using both proprietary and open-source models. The study reveals a pronounced gap between single-screen and multi-screen reasoning, with Chain-of-Thought prompting providing notable gains, especially in multi-screen and trace tasks. The work highlights practical implications for developing robust web agents and suggests directions like OCR, DOM-aware representations, and interactive multi-turn dialogues to advance cross-page UI reasoning.

Abstract

The rise of powerful multimodal LLMs has enhanced the viability of building web agents which can, with increasing levels of autonomy, assist users to retrieve information and complete tasks on various human-computer interfaces. It is hence necessary to build challenging benchmarks that span a wide-variety of use cases reflecting real-world usage. In this work, we present WebQuest, a multi-page question-answering dataset that requires reasoning across multiple related web pages. In contrast to existing UI benchmarks that focus on multi-step web navigation and task completion, our dataset evaluates information extraction, multimodal retrieval and composition of information from many web pages. WebQuest includes three question categories: single-screen QA, multi-screen QA, and QA based on navigation traces. We evaluate leading proprietary multimodal models like GPT-4V, Gemini Flash, Claude 3, and open source models like InstructBLIP, PaliGemma on our dataset, revealing a significant gap between single-screen and multi-screen reasoning. Finally, we investigate inference time techniques like Chain-of-Thought prompting to improve model capabilities on multi-screen reasoning.
Paper Structure (29 sections, 19 figures, 3 tables)

This paper contains 29 sections, 19 figures, 3 tables.

Figures (19)

  • Figure 1: An example of multi-page question and answering in WebQuest. In this example, there are 4 screenshots and 2 question-answer pairs. Answering both questions needs reasoning over information extracted from the different screenshots.
  • Figure 2: An example of Single Screen QA, where the task is to count how many cloudy days before a rainy one, and the weather conditions are depicted by pictograms.
  • Figure 3: Distribution of website categories of Single Screen QA examples.
  • Figure 4: An example of Trace QA on airfare differences across cabin classes. A browsing session is provided with all screen sequences, but not all of them are required to answer the question.
  • Figure 5: This figure demonstrates the relationships among Single Screen QA, Multi Screen QA, and Trace QA. Single Screen QA focuses on a single page, Multi Screen QA focuses on multiple pages within a browsing session, and Trace QA focuses on the entire browsing session.
  • ...and 14 more figures