Table of Contents
Fetching ...

ELASTIC: Efficient Linear Attention for Sequential Interest Compression

Jiaxin Deng, Shiyao Wang, Song Lu, Yinfeng Li, Xinchen Luo, Yuanjun Liu, Peixing Xu, Guorui Zhou

TL;DR

The paper tackles the inefficiency of self-attention for long user behavior sequences in sequential recommendation. It introduces ELASTIC, which integrates a Linear Dispatcher Attention (LDA) layer to compress long histories into a fixed-length set of interest tokens and an Interest Memory Retrieval (IMR) layer to sparsely retrieve a large, learnable memory of user interests via product-key memory. Key contributions include achieving linear-time attention with $O(Nk)$ complexity, a scalable memory-based interest space, and extensive empirical validation showing competitive or superior accuracy with substantial memory and speed gains (up to $90\%$ memory reduction and $2.7\times$ speedups on long sequences) on ML-1M and XLong. The approach enables practical modeling of ultra-long sequences for hundreds of millions of users, supported by ablations and analyses of expert activation patterns, and is accompanied by reproducibility provisions and publicly available code.

Abstract

State-of-the-art sequential recommendation models heavily rely on transformer's attention mechanism. However, the quadratic computational and memory complexities of self attention have limited its scalability for modeling users' long range behaviour sequences. To address this problem, we propose ELASTIC, an Efficient Linear Attention for SequenTial Interest Compression, requiring only linear time complexity and decoupling model capacity from computational cost. Specifically, ELASTIC introduces a fixed length interest experts with linear dispatcher attention mechanism which compresses the long-term behaviour sequences to a significantly more compact representation which reduces up to 90% GPU memory usage with x2.7 inference speed up. The proposed linear dispatcher attention mechanism significantly reduces the quadratic complexity and makes the model feasible for adequately modeling extremely long sequences. Moreover, in order to retain the capacity for modeling various user interests, ELASTIC initializes a vast learnable interest memory bank and sparsely retrieves compressed user's interests from the memory with a negligible computational overhead. The proposed interest memory retrieval technique significantly expands the cardinality of available interest space while keeping the same computational cost, thereby striking a trade-off between recommendation accuracy and efficiency. To validate the effectiveness of our proposed ELASTIC, we conduct extensive experiments on various public datasets and compare it with several strong sequential recommenders. Experimental results demonstrate that ELASTIC consistently outperforms baselines by a significant margin and also highlight the computational efficiency of ELASTIC when modeling long sequences. We will make our implementation code publicly available.

ELASTIC: Efficient Linear Attention for Sequential Interest Compression

TL;DR

The paper tackles the inefficiency of self-attention for long user behavior sequences in sequential recommendation. It introduces ELASTIC, which integrates a Linear Dispatcher Attention (LDA) layer to compress long histories into a fixed-length set of interest tokens and an Interest Memory Retrieval (IMR) layer to sparsely retrieve a large, learnable memory of user interests via product-key memory. Key contributions include achieving linear-time attention with complexity, a scalable memory-based interest space, and extensive empirical validation showing competitive or superior accuracy with substantial memory and speed gains (up to memory reduction and speedups on long sequences) on ML-1M and XLong. The approach enables practical modeling of ultra-long sequences for hundreds of millions of users, supported by ablations and analyses of expert activation patterns, and is accompanied by reproducibility provisions and publicly available code.

Abstract

State-of-the-art sequential recommendation models heavily rely on transformer's attention mechanism. However, the quadratic computational and memory complexities of self attention have limited its scalability for modeling users' long range behaviour sequences. To address this problem, we propose ELASTIC, an Efficient Linear Attention for SequenTial Interest Compression, requiring only linear time complexity and decoupling model capacity from computational cost. Specifically, ELASTIC introduces a fixed length interest experts with linear dispatcher attention mechanism which compresses the long-term behaviour sequences to a significantly more compact representation which reduces up to 90% GPU memory usage with x2.7 inference speed up. The proposed linear dispatcher attention mechanism significantly reduces the quadratic complexity and makes the model feasible for adequately modeling extremely long sequences. Moreover, in order to retain the capacity for modeling various user interests, ELASTIC initializes a vast learnable interest memory bank and sparsely retrieves compressed user's interests from the memory with a negligible computational overhead. The proposed interest memory retrieval technique significantly expands the cardinality of available interest space while keeping the same computational cost, thereby striking a trade-off between recommendation accuracy and efficiency. To validate the effectiveness of our proposed ELASTIC, we conduct extensive experiments on various public datasets and compare it with several strong sequential recommenders. Experimental results demonstrate that ELASTIC consistently outperforms baselines by a significant margin and also highlight the computational efficiency of ELASTIC when modeling long sequences. We will make our implementation code publicly available.
Paper Structure (22 sections, 10 equations, 5 figures, 3 tables)

This paper contains 22 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Trade-off between NDCG@10 ($y$-axis), inference speed ($x$-axis) and model parameter (cir-radius) on ML-1M.
  • Figure 2: Framework of proposed ELASTIC. The core part of this framework is the LDA layer and IMR layer. LDA layer includes two aggregating and dispatching cross attention mechanisms. IMR layer consists of a hierarchical query network and interest experts retrieval layer.
  • Figure 3: Model efficiency: ELASTIC vs SASRec for training GPU memory usage and inference latency in XLong.
  • Figure 4: Visualization of the expert activation pattern of IMR layer on ML-1M.
  • Figure 5: Hyperparameter sensitivity on ML-1M.