MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels
Qi Chen, Xiubo Geng, Corby Rosset, Carolyn Buractaon, Jingwen Lu, Tao Shen, Kun Zhou, Chenyan Xiong, Yeyun Gong, Paul Bennett, Nick Craswell, Xing Xie, Fan Yang, Bryan Tower, Nikhil Rao, Anlei Dong, Wenqi Jiang, Zheng Liu, Mingqin Li, Chuanjie Liu, Zengzhong Li, Rangan Majumder, Jennifer Neville, Andy Oakley, Knut Magne Risvik, Harsha Vardhan Simhadri, Manik Varma, Yujing Wang, Linjun Yang, Mao Yang, Ce Zhang
TL;DR
MS MARCO Web Search introduces a large-scale, information-rich web dataset built on ClueWeb22 with millions of real clicked query-document labels across 93 languages. It provides two document scales (100M and 10B) and train/test/query distributions derived from Bing logs, with a time-based split to stress generalization. The paper defines three benchmark tasks—embedding-model evaluation, ANN-based retrieval, and end-to-end retrieval systems—and reports baseline results showing how end-to-end performance depends on the integration of embedding models, ANN indices, and traditional rankers, with SimANS performing best among embeddings in brute-force scenarios. By delivering real click labels, multilingual coverage, and web-scale skew, MS MARCO Web Search offers a foundational resource to drive advances in neural indexing, embedding methods, and LLM-enabled information access for robust, scalable retrieval in practice.
Abstract
Recent breakthroughs in large models have highlighted the critical significance of data scale, labels and modals. In this paper, we introduce MS MARCO Web Search, the first large-scale information-rich web dataset, featuring millions of real clicked query-document labels. This dataset closely mimics real-world web document and query distribution, provides rich information for various kinds of downstream tasks and encourages research in various areas, such as generic end-to-end neural indexer models, generic embedding models, and next generation information access system with large language models. MS MARCO Web Search offers a retrieval benchmark with three web retrieval challenge tasks that demand innovations in both machine learning and information retrieval system research domains. As the first dataset that meets large, real and rich data requirements, MS MARCO Web Search paves the way for future advancements in AI and system research. MS MARCO Web Search dataset is available at: https://github.com/microsoft/MS-MARCO-Web-Search.
