On the Efficiency of Sequentially Aware Recommender Systems: Cotten4Rec

Shankar Veludandi; Gulrukh Kurdistan; Uzma Mushtaque

On the Efficiency of Sequentially Aware Recommender Systems: Cotten4Rec

Shankar Veludandi, Gulrukh Kurdistan, Uzma Mushtaque

TL;DR

Cotten4Rec introduces a cosine-similarity attention mechanism for sequential recommender systems to reduce memory and compute complexity. Implemented as a single fused CUDA kernel within a BERT4Rec-like encoder, it achieves linear-time memory behavior $O(s d^2)$ while maintaining competitive recommendation accuracy. Across three real-world datasets, Cotten4Rec lowers peak GPU memory by about $23\%$ and delivers up to $\approx 20\%$ faster training on moderate-length sequences, with modest losses in NDCG@10 and HIT@10 on longer sequences. The work demonstrates a practical efficiency-accuracy trade-off for large-vocabulary, short-to-medium sequence SR tasks, though it notes limits in very long histories and portability due to the custom kernel.

Abstract

Sequential recommendation (SR) models predict a user's next interaction by modeling their historical behaviors. Transformer-based SR methods, notably BERT4Rec, effectively capture these patterns but incur significant computational overhead due to extensive intermediate computations associated with Softmax-based attention. We propose Cotten4Rec, a novel SR model utilizing linear-time cosine similarity attention, implemented through a single optimized compute unified device architecture (CUDA) kernel. By minimizing intermediate buffers and kernel-launch overhead, Cotten4Rec substantially reduces resource usage compared to BERT4Rec and the linear-attention baseline, LinRec, especially for datasets with moderate sequence lengths and vocabulary sizes. Evaluations across three benchmark datasets confirm that Cotten4Rec achieves considerable reductions in memory and runtime with minimal compromise in recommendation accuracy, demonstrating Cotten4Rec's viability as an efficient alternative for practical, large-scale sequential recommendation scenarios where computational resources are critical.

On the Efficiency of Sequentially Aware Recommender Systems: Cotten4Rec

TL;DR

while maintaining competitive recommendation accuracy. Across three real-world datasets, Cotten4Rec lowers peak GPU memory by about

and delivers up to

faster training on moderate-length sequences, with modest losses in NDCG@10 and HIT@10 on longer sequences. The work demonstrates a practical efficiency-accuracy trade-off for large-vocabulary, short-to-medium sequence SR tasks, though it notes limits in very long histories and portability due to the custom kernel.

Abstract

Paper Structure (28 sections, 15 equations, 4 figures, 3 tables)

This paper contains 28 sections, 15 equations, 4 figures, 3 tables.

Introduction
Related Work
Scaled Dot-Product Attention
Recommender System Models for the SR Problem
Linear Attention Models
Cosine Attention-Based Models
Methodology
Problem Statement
Attention Mechanisms
Model Description
Cosine Similarity and CUDA Kernel for Cotten4Rec
Bidirectional Attention Adaptation.
Memory and Compute Complexity.
Automatic Mixed Precision (AMP).
Objective Function
...and 13 more sections

Figures (4)

Figure 1: Memory Usage vs Sequence Length
Figure 2: Memory Usage vs Embedding Size
Figure 3: Training Time per Epoch vs Sequence Length
Figure 4: Training Time per Epoch vs Embedding Size

On the Efficiency of Sequentially Aware Recommender Systems: Cotten4Rec

TL;DR

Abstract

On the Efficiency of Sequentially Aware Recommender Systems: Cotten4Rec

Authors

TL;DR

Abstract

Table of Contents

Figures (4)