MerRec: A Large-scale Multipurpose Mercari Dataset for Consumer-to-Consumer Recommendation Systems

Lichi Li; Zainul Abi Din; Zhen Tan; Sam London; Tianlong Chen; Ajay Daptardar

MerRec: A Large-scale Multipurpose Mercari Dataset for Consumer-to-Consumer Recommendation Systems

Lichi Li, Zainul Abi Din, Zhen Tan, Sam London, Tianlong Chen, Ajay Daptardar

TL;DR

MerRec tackles the gap in C2C recommender research by introducing a large-scale Mercari-derived dataset with rich item and interaction features across six months in 2023, designed to support diverse research tasks. The paper presents Mercatran, a three-tower transformer model that encodes users with long histories and items via content features, enabling multi-step recommendations and retrieval in a vector database. Through experiments on CTR, SBR, MT L, and IAR, MerRec demonstrates both the dataset’s complexity and its value as a benchmark, with production-ready deployment considerations and code/dataset availability. Overall, MerRec advances realistic C2C recommender research by providing a scalable, attribute-rich resource and a tailored modeling approach that addresses SKU-less, user-dual-role dynamics in real-world marketplaces, bridging academia and industry impact.

Abstract

In the evolving e-commerce field, recommendation systems crucially shape user experience and engagement. The rise of Consumer-to-Consumer (C2C) recommendation systems, noted for their flexibility and ease of access for customer vendors, marks a significant trend. However, the academic focus remains largely on Business-to-Consumer (B2C) models, leaving a gap filled by the limited C2C recommendation datasets that lack in item attributes, user diversity, and scale. The intricacy of C2C recommendation systems is further accentuated by the dual roles users assume as both sellers and buyers, introducing a spectrum of less uniform and varied inputs. Addressing this, we introduce MerRec, the first large-scale dataset specifically for C2C recommendations, sourced from the Mercari e-commerce platform, covering millions of users and products over 6 months in 2023. MerRec not only includes standard features such as user_id, item_id, and session_id, but also unique elements like timestamped action types, product taxonomy, and textual product attributes, offering a comprehensive dataset for research. This dataset, extensively evaluated across four recommendation tasks, establishes a new benchmark for the development of advanced recommendation algorithms in real-world scenarios, bridging the gap between academia and industry and propelling the study of C2C recommendations. Our experiment code is available at https://github.com/mercari/mercari-ml-merrec-pub-us and dataset at https://huggingface.co/datasets/mercari-us/merrec.

MerRec: A Large-scale Multipurpose Mercari Dataset for Consumer-to-Consumer Recommendation Systems

TL;DR

Abstract

Paper Structure (38 sections, 14 figures, 12 tables)

This paper contains 38 sections, 14 figures, 12 tables.

Introduction
Related Work
E-commerce Recommendation Systems
Datasets for Recommendation Systems
The Dataset
Dataset Requirement
Data Cleaning and Processing
Comparison to Other E-commerce Datasets
Experiment & Analysis
Click-Through Rate (CTR) Prediction
Task Description.
Dataset Setup.
Baseline Setup.
Implications.
Session-based Recommendation (SBR)
...and 23 more sections

Figures (14)

Figure 1: Comparison of B2C and C2C E-commerce Platforms: This illustration highlights the differences in product descriptions between B2C and C2C platforms. B2C platforms typically feature consistent and professionally crafted merchandise descriptions. Conversely, product descriptions on C2C platforms are often more varied and less standardized, posing challenges to the robustness of recommendations.
Figure 2: Break down of the C0-level category appearances over the distinct items. MerRec dataset has some concentration over Women and Toys & Collectibles, but is overall reasonably balanced to represent a broad spectrum of items available on Mercari marketplace.
Figure 3: Break down of the top 50 C1-level category appearances over the distinct items. The stacked bars represent categories which have the same name but originally belonging under different C0-level categories. For example, there are two distinct C1 category IDs called Shoes, one from the C0 Womens and another from the C0 Mens.
Figure 4: Break down of item condition appearances over the distinct items.
Figure 5: Word cloud of the most frequently observed words within MerRec's item titles. Stop words are omitted.
...and 9 more figures

MerRec: A Large-scale Multipurpose Mercari Dataset for Consumer-to-Consumer Recommendation Systems

TL;DR

Abstract

MerRec: A Large-scale Multipurpose Mercari Dataset for Consumer-to-Consumer Recommendation Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (14)