Table of Contents
Fetching ...

DeepShop: A Benchmark for Deep Research Shopping Agents

Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, Xiuying Chen

TL;DR

DeepShop introduces a realistic benchmark for evaluating web shopping agents under complex, multi-attribute queries across five product domains. It combines query diversity and complexity evolution with a two-stage, fine-grained and holistic evaluation framework, revealing substantial gaps for simple RAG, current web agents, and even deep-research systems. The results highlight grounding, replanning, and action-space limitations in agents and show that even advanced systems struggle with filters and sorting under realistic constraints. This benchmark advances the field by providing a principled testbed to drive development of robust, adaptable deep research shopping agents with better generalization and reliability.

Abstract

Web agents for online shopping have shown great promise in automating user interactions across e-commerce platforms. Benchmarks for assessing such agents do not reflect the complexity of real-world shopping scenarios, as they often consist of overly simple queries with deterministic paths, such as "Find iPhone 15." Real shopping scenarios are inherently more layered, involving multi-dimensional product attributes, search filters, and user-specific sorting preferences. To address this gap, we introduce DeepShop, a benchmark designed to evaluate web agents in complex and realistic online shopping environments. DeepShop comprises three key components. (1) Query diversity evolution: Starting from real user queries, we generate diverse queries across five popular online shopping domains. (2) Query complexity evolution: We further evolve these queries to increase complexity, considering product attributes, search filters, and sorting preferences, and classify them into three levels: easy, medium, and hard, based on the number of evolutions. (3) Fine-grained and holistic evaluation: We propose an automated evaluation framework that assesses agent performance in terms of fine-grained aspects (product attributes, search filters, and sorting preferences) and reports the overall success rate through holistic evaluation. We conduct a systematic evaluation of retrieval-augmented generation (RAG) methods, web agents, and deep research systems. Results show that RAG struggles with complex queries due to its lack of web interaction, while other methods face significant challenges with filters and sorting preferences, leading to low overall success rates. We also perform cross-category, complexity-based evaluations and error analyses to support the advancement of deep research shopping agents.

DeepShop: A Benchmark for Deep Research Shopping Agents

TL;DR

DeepShop introduces a realistic benchmark for evaluating web shopping agents under complex, multi-attribute queries across five product domains. It combines query diversity and complexity evolution with a two-stage, fine-grained and holistic evaluation framework, revealing substantial gaps for simple RAG, current web agents, and even deep-research systems. The results highlight grounding, replanning, and action-space limitations in agents and show that even advanced systems struggle with filters and sorting under realistic constraints. This benchmark advances the field by providing a principled testbed to drive development of robust, adaptable deep research shopping agents with better generalization and reliability.

Abstract

Web agents for online shopping have shown great promise in automating user interactions across e-commerce platforms. Benchmarks for assessing such agents do not reflect the complexity of real-world shopping scenarios, as they often consist of overly simple queries with deterministic paths, such as "Find iPhone 15." Real shopping scenarios are inherently more layered, involving multi-dimensional product attributes, search filters, and user-specific sorting preferences. To address this gap, we introduce DeepShop, a benchmark designed to evaluate web agents in complex and realistic online shopping environments. DeepShop comprises three key components. (1) Query diversity evolution: Starting from real user queries, we generate diverse queries across five popular online shopping domains. (2) Query complexity evolution: We further evolve these queries to increase complexity, considering product attributes, search filters, and sorting preferences, and classify them into three levels: easy, medium, and hard, based on the number of evolutions. (3) Fine-grained and holistic evaluation: We propose an automated evaluation framework that assesses agent performance in terms of fine-grained aspects (product attributes, search filters, and sorting preferences) and reports the overall success rate through holistic evaluation. We conduct a systematic evaluation of retrieval-augmented generation (RAG) methods, web agents, and deep research systems. Results show that RAG struggles with complex queries due to its lack of web interaction, while other methods face significant challenges with filters and sorting preferences, leading to low overall success rates. We also perform cross-category, complexity-based evaluations and error analyses to support the advancement of deep research shopping agents.

Paper Structure

This paper contains 29 sections, 2 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: DeepShop evaluates agents on realistic and complex shopping queries with fine-grained, holistic metrics, while existing benchmarks use overly simple queries lacking contextual depth.
  • Figure 2: Running examples of diversity and complexity evolution in DeepShop. Complexity evolution includes attribute evolution, filter evolution, and sorting evolution.
  • Figure 3: Product category distribution after query diversity evolution.
  • Figure 4: Analysis of query complexity evolution.
  • Figure 5: Detailed analysis of performance across different product categories and query complexity.
  • ...and 6 more figures