Table of Contents
Fetching ...

A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems

Hung Vinh Tran, Tong Chen, Quoc Viet Hung Nguyen, Zi Huang, Lizhen Cui, Hongzhi Yin

TL;DR

This paper addresses the lack of standardized benchmarks for lightweight embedding-based recommender systems (LERSs) by conducting a comprehensive cross-task evaluation across collaborative filtering and content-based tasks, using two CF and two CB datasets, and examining three compression goals. It benchmarks a diverse set of LERS approaches (compositional, pruning, NAS-based, and hybrids) alongside a simple magnitude-pruning baseline (MagPrune), tuned via Tree-structured Parzen Estimation and evaluated on GPU and edge devices. Key findings show that performance gains depend on task and dataset; simple pruning (e.g., PEP, MagPrune) can rival more complex methods, while cross-task transferability varies across methods. The work provides practical guidance for model selection, highlights the real-world efficiency tradeoffs, and releases open-source code to facilitate reproducibility and future research in LERSs for edge-enabled recommendations.

Abstract

Since the creation of the Web, recommender systems (RSs) have been an indispensable mechanism in information filtering. State-of-the-art RSs primarily depend on categorical features, which ecoded by embedding vectors, resulting in excessively large embedding tables. To prevent over-parameterized embedding tables from harming scalability, both academia and industry have seen increasing efforts in compressing RS embeddings. However, despite the prosperity of lightweight embedding-based RSs (LERSs), a wide diversity is seen in evaluation protocols, resulting in obstacles when relating LERS performance to real-world usability. Moreover, despite the common goal of lightweight embeddings, LERSs are evaluated with a single choice between the two main recommendation tasks -- collaborative filtering and content-based recommendation. This lack of discussions on cross-task transferability hinders the development of unified, more scalable solutions. Motivated by these issues, this study investigates various LERSs' performance, efficiency, and cross-task transferability via a thorough benchmarking process. Additionally, we propose an efficient embedding compression method using magnitude pruning, which is an easy-to-deploy yet highly competitive baseline that outperforms various complex LERSs. Our study reveals the distinct performance of LERSs across the two tasks, shedding light on their effectiveness and generalizability. To support edge-based recommendations, we tested all LERSs on a Raspberry Pi 4, where the efficiency bottleneck is exposed. Finally, we conclude this paper with critical summaries of LERS performance, model selection suggestions, and underexplored challenges around LERSs for future research. To encourage future research, we publish source codes and artifacts at \href{this link}{https://github.com/chenxing1999/recsys-benchmark}.

A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems

TL;DR

This paper addresses the lack of standardized benchmarks for lightweight embedding-based recommender systems (LERSs) by conducting a comprehensive cross-task evaluation across collaborative filtering and content-based tasks, using two CF and two CB datasets, and examining three compression goals. It benchmarks a diverse set of LERS approaches (compositional, pruning, NAS-based, and hybrids) alongside a simple magnitude-pruning baseline (MagPrune), tuned via Tree-structured Parzen Estimation and evaluated on GPU and edge devices. Key findings show that performance gains depend on task and dataset; simple pruning (e.g., PEP, MagPrune) can rival more complex methods, while cross-task transferability varies across methods. The work provides practical guidance for model selection, highlights the real-world efficiency tradeoffs, and releases open-source code to facilitate reproducibility and future research in LERSs for edge-enabled recommendations.

Abstract

Since the creation of the Web, recommender systems (RSs) have been an indispensable mechanism in information filtering. State-of-the-art RSs primarily depend on categorical features, which ecoded by embedding vectors, resulting in excessively large embedding tables. To prevent over-parameterized embedding tables from harming scalability, both academia and industry have seen increasing efforts in compressing RS embeddings. However, despite the prosperity of lightweight embedding-based RSs (LERSs), a wide diversity is seen in evaluation protocols, resulting in obstacles when relating LERS performance to real-world usability. Moreover, despite the common goal of lightweight embeddings, LERSs are evaluated with a single choice between the two main recommendation tasks -- collaborative filtering and content-based recommendation. This lack of discussions on cross-task transferability hinders the development of unified, more scalable solutions. Motivated by these issues, this study investigates various LERSs' performance, efficiency, and cross-task transferability via a thorough benchmarking process. Additionally, we propose an efficient embedding compression method using magnitude pruning, which is an easy-to-deploy yet highly competitive baseline that outperforms various complex LERSs. Our study reveals the distinct performance of LERSs across the two tasks, shedding light on their effectiveness and generalizability. To support edge-based recommendations, we tested all LERSs on a Raspberry Pi 4, where the efficiency bottleneck is exposed. Finally, we conclude this paper with critical summaries of LERS performance, model selection suggestions, and underexplored challenges around LERSs for future research. To encourage future research, we publish source codes and artifacts at \href{this link}{https://github.com/chenxing1999/recsys-benchmark}.

Paper Structure

This paper contains 44 sections, 25 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration for main archetypes of LERSs embedding. (a) The original embedding table uses a dedicated vector $\mathbf{e}_i$ from table $\mathbf{E}$ to represent feature $i$. (b) The compositional embedding approach employs a smaller embedding table $\mathbf{E}^{meta}$ and a hash function that hashes $i$ into $(i_1, i_2)$. The final embedding $\mathbf{e}_i$ is constructed by combining $\mathbf{e}^{meta}_{i_1}$ and $\mathbf{e}^{meta}_{i_2}$. (c) Pruning reduces the size of the original embedding table by zeroing out parts of it (shown in gray), resulting in a sparse embedding table $\mathbf{\hat{E}}$ with a smaller storage footprint. (d) Methods based on neural architecture search (NAS) commonly search for an optimal embedding dimension configuration $S$ for the embedding table from a given search space, optimizing the trade-off between memory cost and model performance.
  • Figure 2: Log frequency of features and their respective mean magnitude in the learned embeddings. Dataset details are provided in Section \ref{['sec:exp_setting']}. We visualize all features in the first and second feature fields in Criteo. For Gowalla, we limit the user/item number to 200 for better visibility.
  • Figure 3: Training and Inference resource usage on the Criteo dataset. The asterisk "*" in inference means we store the weight matrix under SparseCSR format. $\ddagger$ means creating hash codes on-the-fly for DHE. TTRec implements a custom CUDA kernel for training, which naturally cannot be implemented on the CPU. TTRec's cache is assumed to be 10% of the original model. "Mem" is a shorthand for memory, "Packages" refers to Python packages and overhead in general, and "Metadata" refers to the CPU memory required to store the mapping from features (as string data type) to the corresponding feature IDs (as integer data type).
  • Figure 4: Training and Inference resource usage on Yelp2018 dataset. The asterisk "*" in inference means we store the weight matrix under SparseCSR format. TTRec implements a custom CUDA kernel for training, which naturally cannot be implemented on the CPU. TTRec's cache is assumed to be 10% of the original model. "Mem" is a shorthand for memory, "Packages" refers to Python packages and overhead in general, and "Metadata" refers to the CPU memory required to store the mapping from features (as string data type) to the corresponding feature IDs (as integer data type).
  • Figure 5: Performance of MagPrune with different $n\_min$. For Yelp2018 and Avazu, the backbones are respectively LightGCN and DCN.
  • ...and 2 more figures