Table of Contents
Fetching ...

SplitLight: An Exploratory Toolkit for Recommender Systems Datasets and Splits

Anna Volodkevich, Dmitry Anikin, Danil Gusak, Anton Klenitskiy, Evgeny Frolov, Alexey Vasilev

TL;DR

SplitLight is introduced, an open-source exploratory toolkit that enables researchers and practitioners designing preprocessing and splitting pipelines or reviewing external artifacts to make decisions measurable, comparable, and reportable in recommender systems research and industry.

Abstract

Offline evaluation of recommender systems is often affected by hidden, under-documented choices in data preparation. Seemingly minor decisions in filtering, handling repeats, cold-start treatment, and splitting strategy design can substantially reorder model rankings and undermine reproducibility and cross-paper comparability. In this paper, we introduce SplitLight, an open-source exploratory toolkit that enables researchers and practitioners designing preprocessing and splitting pipelines or reviewing external artifacts to make these decisions measurable, comparable, and reportable. Given an interaction log and derived split subsets, SplitLight analyzes core and temporal dataset statistics, characterizes repeat consumption patterns and timestamp anomalies, and diagnoses split validity, including temporal leakage, cold-user/item exposure, and distribution shifts. SplitLight further allows side-by-side comparison of alternative splitting strategies through comprehensive aggregated summaries and interactive visualizations. Delivered as both a Python toolkit and an interactive no-code interface, SplitLight produces audit summaries that justify evaluation protocols and support transparent, reliable, and comparable experimentation in recommender systems research and industry.

SplitLight: An Exploratory Toolkit for Recommender Systems Datasets and Splits

TL;DR

SplitLight is introduced, an open-source exploratory toolkit that enables researchers and practitioners designing preprocessing and splitting pipelines or reviewing external artifacts to make decisions measurable, comparable, and reportable in recommender systems research and industry.

Abstract

Offline evaluation of recommender systems is often affected by hidden, under-documented choices in data preparation. Seemingly minor decisions in filtering, handling repeats, cold-start treatment, and splitting strategy design can substantially reorder model rankings and undermine reproducibility and cross-paper comparability. In this paper, we introduce SplitLight, an open-source exploratory toolkit that enables researchers and practitioners designing preprocessing and splitting pipelines or reviewing external artifacts to make these decisions measurable, comparable, and reportable. Given an interaction log and derived split subsets, SplitLight analyzes core and temporal dataset statistics, characterizes repeat consumption patterns and timestamp anomalies, and diagnoses split validity, including temporal leakage, cold-user/item exposure, and distribution shifts. SplitLight further allows side-by-side comparison of alternative splitting strategies through comprehensive aggregated summaries and interactive visualizations. Delivered as both a Python toolkit and an interactive no-code interface, SplitLight produces audit summaries that justify evaluation protocols and support transparent, reliable, and comparable experimentation in recommender systems research and industry.
Paper Structure (25 sections, 7 figures, 4 tables)

This paper contains 25 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Global temporal split structure for (a) last-item and (b) all-items target definitions. Evaluation inputs include all within-user history before evaluation targets.
  • Figure 2: Analysis of cold items in test target subset for Diginetica dataset. Share of interactions with cold items grows in time from nearly 20% to 40%, resulting in 34.35% overall.
  • Figure 3: Temporal properties of the data in case of the leave-one-out split: dataset timeframe, train/evaluation durations and overlap, and per-user lifetime coverage.
  • Figure 4: Repeat consumption illustration: (a) repeated interactions (after first occurrence of an item) and (b) consecutive repeats (same item in immediate succession).
  • Figure 5: Temporal skew in ML-1M interactions. Test target subset spans nearly 76% of dataset timeframe under GTS $q_{0.9}$.
  • ...and 2 more figures