WebQuest: A Benchmark for Multimodal QA on Web Page Sequences
Maria Wang, Srinivas Sunkara, Gilles Baechler, Jason Lin, Yun Zhu, Fedir Zubach, Lei Shu, Jindong Chen
TL;DR
WebQuest introduces a multimodal benchmark for QA across sequences of web pages, addressing a gap where users increasingly rely on cross-site information retrieval. It defines three QA categories—single-screen, multi-screen, and trace QA—to evaluate information extraction, multimodal retrieval, and navigation-based reasoning using both proprietary and open-source models. The study reveals a pronounced gap between single-screen and multi-screen reasoning, with Chain-of-Thought prompting providing notable gains, especially in multi-screen and trace tasks. The work highlights practical implications for developing robust web agents and suggests directions like OCR, DOM-aware representations, and interactive multi-turn dialogues to advance cross-page UI reasoning.
Abstract
The rise of powerful multimodal LLMs has enhanced the viability of building web agents which can, with increasing levels of autonomy, assist users to retrieve information and complete tasks on various human-computer interfaces. It is hence necessary to build challenging benchmarks that span a wide-variety of use cases reflecting real-world usage. In this work, we present WebQuest, a multi-page question-answering dataset that requires reasoning across multiple related web pages. In contrast to existing UI benchmarks that focus on multi-step web navigation and task completion, our dataset evaluates information extraction, multimodal retrieval and composition of information from many web pages. WebQuest includes three question categories: single-screen QA, multi-screen QA, and QA based on navigation traces. We evaluate leading proprietary multimodal models like GPT-4V, Gemini Flash, Claude 3, and open source models like InstructBLIP, PaliGemma on our dataset, revealing a significant gap between single-screen and multi-screen reasoning. Finally, we investigate inference time techniques like Chain-of-Thought prompting to improve model capabilities on multi-screen reasoning.
