Table of Contents
Fetching ...

Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search

Chandan K. Reddy, Lluís Màrquez, Fran Valero, Nikhil Rao, Hugo Zaragoza, Sambaran Bandyopadhyay, Arnab Biswas, Anlu Xing, Karthik Subbian

TL;DR

The Shopping Queries Dataset addresses the challenge of high-quality product search by providing a large-scale, multilingual benchmark built from real Amazon queries, with about 130,000 unique queries and 2.6 million manually labeled judgments across English, Spanish, and Japanese. It defines three tasks—query-product ranking, multiclass relevance classification, and substitute identification—and provides baseline model results including $nDCG$ and $F1$ metrics to establish a reference. The dataset combines breadth and depth with rich product metadata and human annotations, aiming to become a gold standard for evaluating and improving e-commerce search and semantic matching. By offering baselines based on neural encoders (Cross-Encoder, MPNet) and traditional IR (BM25), the work demonstrates strong cross-language performance and highlights substantial room for improvement, motivating future research and real-world deployment improvements.

Abstract

Improving the quality of search results can significantly enhance users experience and engagement with search engines. In spite of several recent advancements in the fields of machine learning and data mining, correctly classifying items for a particular user search query has been a long-standing challenge, which still has a large room for improvement. This paper introduces the "Shopping Queries Dataset", a large dataset of difficult Amazon search queries and results, publicly released with the aim of fostering research in improving the quality of search results. The dataset contains around 130 thousand unique queries and 2.6 million manually labeled (query,product) relevance judgements. The dataset is multilingual with queries in English, Japanese, and Spanish. The Shopping Queries Dataset is being used in one of the KDDCup'22 challenges. In this paper, we describe the dataset and present three evaluation tasks along with baseline results: (i) ranking the results list, (ii) classifying product results into relevance categories, and (iii) identifying substitute products for a given query. We anticipate that this data will become the gold standard for future research in the topic of product search.

Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search

TL;DR

The Shopping Queries Dataset addresses the challenge of high-quality product search by providing a large-scale, multilingual benchmark built from real Amazon queries, with about 130,000 unique queries and 2.6 million manually labeled judgments across English, Spanish, and Japanese. It defines three tasks—query-product ranking, multiclass relevance classification, and substitute identification—and provides baseline model results including and metrics to establish a reference. The dataset combines breadth and depth with rich product metadata and human annotations, aiming to become a gold standard for evaluating and improving e-commerce search and semantic matching. By offering baselines based on neural encoders (Cross-Encoder, MPNet) and traditional IR (BM25), the work demonstrates strong cross-language performance and highlights substantial room for improvement, motivating future research and real-world deployment improvements.

Abstract

Improving the quality of search results can significantly enhance users experience and engagement with search engines. In spite of several recent advancements in the fields of machine learning and data mining, correctly classifying items for a particular user search query has been a long-standing challenge, which still has a large room for improvement. This paper introduces the "Shopping Queries Dataset", a large dataset of difficult Amazon search queries and results, publicly released with the aim of fostering research in improving the quality of search results. The dataset contains around 130 thousand unique queries and 2.6 million manually labeled (query,product) relevance judgements. The dataset is multilingual with queries in English, Japanese, and Spanish. The Shopping Queries Dataset is being used in one of the KDDCup'22 challenges. In this paper, we describe the dataset and present three evaluation tasks along with baseline results: (i) ranking the results list, (ii) classifying product results into relevance categories, and (iii) identifying substitute products for a given query. We anticipate that this data will become the gold standard for future research in the topic of product search.
Paper Structure (14 sections, 1 figure, 4 tables)

This paper contains 14 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: MLP classifier whose input is the concatenation of the representations provided by BERT multilingual base for the query and title of the product.