On the Efficiency of Sequentially Aware Recommender Systems: Cotten4Rec
Shankar Veludandi, Gulrukh Kurdistan, Uzma Mushtaque
TL;DR
Cotten4Rec introduces a cosine-similarity attention mechanism for sequential recommender systems to reduce memory and compute complexity. Implemented as a single fused CUDA kernel within a BERT4Rec-like encoder, it achieves linear-time memory behavior $O(s d^2)$ while maintaining competitive recommendation accuracy. Across three real-world datasets, Cotten4Rec lowers peak GPU memory by about $23\%$ and delivers up to $\approx 20\%$ faster training on moderate-length sequences, with modest losses in NDCG@10 and HIT@10 on longer sequences. The work demonstrates a practical efficiency-accuracy trade-off for large-vocabulary, short-to-medium sequence SR tasks, though it notes limits in very long histories and portability due to the custom kernel.
Abstract
Sequential recommendation (SR) models predict a user's next interaction by modeling their historical behaviors. Transformer-based SR methods, notably BERT4Rec, effectively capture these patterns but incur significant computational overhead due to extensive intermediate computations associated with Softmax-based attention. We propose Cotten4Rec, a novel SR model utilizing linear-time cosine similarity attention, implemented through a single optimized compute unified device architecture (CUDA) kernel. By minimizing intermediate buffers and kernel-launch overhead, Cotten4Rec substantially reduces resource usage compared to BERT4Rec and the linear-attention baseline, LinRec, especially for datasets with moderate sequence lengths and vocabulary sizes. Evaluations across three benchmark datasets confirm that Cotten4Rec achieves considerable reductions in memory and runtime with minimal compromise in recommendation accuracy, demonstrating Cotten4Rec's viability as an efficient alternative for practical, large-scale sequential recommendation scenarios where computational resources are critical.
