Table of Contents
Fetching ...

ORBIT -- Open Recommendation Benchmark for Reproducible Research with Hidden Tests

Jingyuan He, Jiongnan Liu, Vishan Vishesh Oberoi, Bolin Wu, Mahima Jagadeesh Patel, Kangrui Mao, Chuning Shi, I-Ta Lee, Arnold Overwijk, Chenyan Xiong

TL;DR

ORBIT tackles the reproducibility and realism gaps in recommender benchmarks by standardizing evaluation across five public datasets and introducing ClueWeb-Reco, a privacy-preserving hidden test for large-scale webpage recommendation. It couples a reproducible public leaderboard with a hidden test built via a privacy-conscious soft matching pipeline that maps real user browsing histories to public ClueWeb pages, enabling realistic yet synthetic next-item evaluation. Across 12 benchmarked models, content-based and LLM-enhanced approaches show clear benefits on public data, while the ClueWeb-Reco hidden test reveals strong potential for LLM-driven query generation to handle vast candidate pools, albeit with model-dependent trade-offs. The ORBIT framework, datasets, and codebase aim to foster transparent, comparable, and privacy-aware advancement in recommender-system research, with planned expansions to broaden coverage and model types.

Abstract

Recommender systems are among the most impactful AI applications, interacting with billions of users every day, guiding them to relevant products, services, or information tailored to their preferences. However, the research and development of recommender systems are hindered by existing datasets that fail to capture realistic user behaviors and inconsistent evaluation settings that lead to ambiguous conclusions. This paper introduces the Open Recommendation Benchmark for Reproducible Research with HIdden Tests (ORBIT), a unified benchmark for consistent and realistic evaluation of recommendation models. ORBIT offers a standardized evaluation framework of public datasets with reproducible splits and transparent settings for its public leaderboard. Additionally, ORBIT introduces a new webpage recommendation task, ClueWeb-Reco, featuring web browsing sequences from 87 million public, high-quality webpages. ClueWeb-Reco is a synthetic dataset derived from real, user-consented, and privacy-guaranteed browsing data. It aligns with modern recommendation scenarios and is reserved as the hidden test part of our leaderboard to challenge recommendation models' generalization ability. ORBIT measures 12 representative recommendation models on its public benchmark and introduces a prompted LLM baseline on the ClueWeb-Reco hidden test. Our benchmark results reflect general improvements of recommender systems on the public datasets, with variable individual performances. The results on the hidden test reveal the limitations of existing approaches in large-scale webpage recommendation and highlight the potential for improvements with LLM integrations. ORBIT benchmark, leaderboard, and codebase are available at https://www.open-reco-bench.ai.

ORBIT -- Open Recommendation Benchmark for Reproducible Research with Hidden Tests

TL;DR

ORBIT tackles the reproducibility and realism gaps in recommender benchmarks by standardizing evaluation across five public datasets and introducing ClueWeb-Reco, a privacy-preserving hidden test for large-scale webpage recommendation. It couples a reproducible public leaderboard with a hidden test built via a privacy-conscious soft matching pipeline that maps real user browsing histories to public ClueWeb pages, enabling realistic yet synthetic next-item evaluation. Across 12 benchmarked models, content-based and LLM-enhanced approaches show clear benefits on public data, while the ClueWeb-Reco hidden test reveals strong potential for LLM-driven query generation to handle vast candidate pools, albeit with model-dependent trade-offs. The ORBIT framework, datasets, and codebase aim to foster transparent, comparable, and privacy-aware advancement in recommender-system research, with planned expansions to broaden coverage and model types.

Abstract

Recommender systems are among the most impactful AI applications, interacting with billions of users every day, guiding them to relevant products, services, or information tailored to their preferences. However, the research and development of recommender systems are hindered by existing datasets that fail to capture realistic user behaviors and inconsistent evaluation settings that lead to ambiguous conclusions. This paper introduces the Open Recommendation Benchmark for Reproducible Research with HIdden Tests (ORBIT), a unified benchmark for consistent and realistic evaluation of recommendation models. ORBIT offers a standardized evaluation framework of public datasets with reproducible splits and transparent settings for its public leaderboard. Additionally, ORBIT introduces a new webpage recommendation task, ClueWeb-Reco, featuring web browsing sequences from 87 million public, high-quality webpages. ClueWeb-Reco is a synthetic dataset derived from real, user-consented, and privacy-guaranteed browsing data. It aligns with modern recommendation scenarios and is reserved as the hidden test part of our leaderboard to challenge recommendation models' generalization ability. ORBIT measures 12 representative recommendation models on its public benchmark and introduces a prompted LLM baseline on the ClueWeb-Reco hidden test. Our benchmark results reflect general improvements of recommender systems on the public datasets, with variable individual performances. The results on the hidden test reveal the limitations of existing approaches in large-scale webpage recommendation and highlight the potential for improvements with LLM integrations. ORBIT benchmark, leaderboard, and codebase are available at https://www.open-reco-bench.ai.

Paper Structure

This paper contains 38 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: An illustration of the collection and processing pipeline of ClueWeb-Reco. Subject inputs that pass the two quality control checks are stored and mapped to ClueWeb22 pages through a soft-matching pipeline on the right.
  • Figure 2: Subfigure \ref{['fig:clue-reco_similarity_distri']} illustrates the distribution of the embedding retrieval scores between collected webpages and retrieved webpages. Subfigure \ref{['fig:annotated_rel_ distri']} illustrates the average human-annotated relevance label (1-5) of each quantile of the ascending retrieval scores. Subfigure \ref{['fig:annotation_vs_retrieval']} illustrates the distribution of annotated relevance labels upon mapping created from different retrieval candidates. Subfigure \ref{['fig:clue-reco_session_len_distri']} illustrates the distribution of the number of interactions in sessions of ClueWeb-Reco.
  • Figure 3: Top domain distribution before and after soft-matching. Subfigures \ref{['fig:collected_domains_distribution']}, \ref{['fig:cw_domains_distribution']}, and \ref{['fig:retrieved_domains_distribution']} illustrate the distribution of top-10 domains of the raw collected webpages, randomly sampled ClueWeb22-B EN webpages, and the mapped webpages in ClueWeb-Reco after soft-matching process, respectively.
  • Figure 4: Demography distribution of the raw collected dataset for ClueWeb-Reco.
  • Figure 5: The two interfaces through which subjects submit their browsing data. The Edge Browser Export contains detailed instructions on how to properly export the browsing history file from Edge.
  • ...and 2 more figures