Table of Contents
Fetching ...

Recurrent Preference Memory for Efficient Long-Sequence Generative Recommendation

Yixiao Chen, Yuan Wang, Yue Liu, Qiyao Wang, Ke Cheng, Xin Xu, Juntong Yan, Shuojin Yang, Menghao Guo, Jun Zhang, Huan Yu, Jie Jiang

TL;DR

This work tackles the scalability of generative recommendations over lifelong user histories by introducing Rec2PM, which compresses long histories into compact Preference Memory tokens. A novel self-referential teacher-forcing training paradigm enables fully parallel optimization of recurrent memory updates, while maintaining the ability to update memory iteratively during inference. The Preference Memory acts as an Information Bottleneck, denoising noisy long histories and retaining salient long-term interests, leading to higher accuracy with dramatically reduced storage and latency compared to full-history attention and KV-cache baselines. Empirical results on large-scale benchmarks and an industrial dataset demonstrate favorable trade-offs, confirming the method’s practical viability for real-world lifelong user modeling.

Abstract

Generative recommendation (GenRec) models typically model user behavior via full attention, but scaling to lifelong sequences is hindered by prohibitive computational costs and noise accumulation from stochastic interactions. To address these challenges, we introduce Rec2PM, a framework that compresses long user interaction histories into compact Preference Memory tokens. Unlike traditional recurrent methods that suffer from serial training, Rec2PM employs a novel self-referential teacher-forcing strategy: it leverages a global view of the history to generate reference memories, which serve as supervision targets for parallelized recurrent updates. This allows for fully parallel training while maintaining the capability for iterative updates during inference. Additionally, by representing memory as token embeddings rather than extensive KV caches, Rec2PM achieves extreme storage efficiency. Experiments on large-scale benchmarks show that Rec2PM significantly reduces inference latency and memory footprint while achieving superior accuracy compared to full-sequence models. Analysis reveals that the Preference Memory functions as a denoising Information Bottleneck, effectively filtering interaction noise to capture robust long-term interests.

Recurrent Preference Memory for Efficient Long-Sequence Generative Recommendation

TL;DR

This work tackles the scalability of generative recommendations over lifelong user histories by introducing Rec2PM, which compresses long histories into compact Preference Memory tokens. A novel self-referential teacher-forcing training paradigm enables fully parallel optimization of recurrent memory updates, while maintaining the ability to update memory iteratively during inference. The Preference Memory acts as an Information Bottleneck, denoising noisy long histories and retaining salient long-term interests, leading to higher accuracy with dramatically reduced storage and latency compared to full-history attention and KV-cache baselines. Empirical results on large-scale benchmarks and an industrial dataset demonstrate favorable trade-offs, confirming the method’s practical viability for real-world lifelong user modeling.

Abstract

Generative recommendation (GenRec) models typically model user behavior via full attention, but scaling to lifelong sequences is hindered by prohibitive computational costs and noise accumulation from stochastic interactions. To address these challenges, we introduce Rec2PM, a framework that compresses long user interaction histories into compact Preference Memory tokens. Unlike traditional recurrent methods that suffer from serial training, Rec2PM employs a novel self-referential teacher-forcing strategy: it leverages a global view of the history to generate reference memories, which serve as supervision targets for parallelized recurrent updates. This allows for fully parallel training while maintaining the capability for iterative updates during inference. Additionally, by representing memory as token embeddings rather than extensive KV caches, Rec2PM achieves extreme storage efficiency. Experiments on large-scale benchmarks show that Rec2PM significantly reduces inference latency and memory footprint while achieving superior accuracy compared to full-sequence models. Analysis reveals that the Preference Memory functions as a denoising Information Bottleneck, effectively filtering interaction noise to capture robust long-term interests.
Paper Structure (52 sections, 12 equations, 6 figures, 9 tables)

This paper contains 52 sections, 12 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: A generative recommendation system constructed based on a Tripartite Memory Mechanism.
  • Figure 2: Taxonomy of Memory Compression Paradigms. Representative methods: (a) ICAE ge2023context, (b) RMT bulatov2022recurrent, (c) AutoCompressors chevalier2023adapting, (d) Gist mu2023learning, (e)(Hypothetical scheme for comparison), (f) PersRec zhang2026efficient and Anchor pang2024anchor. For (e) and (f), only the training-time attention mask is shown.
  • Figure 3: Illustration of the proposed two-stage parallel training paradigm. Stage 1 generates global reference memories by attending to raw history. Stage 2 performs parallel optimization for incremental updates (in Overwriting or Appending modes) under the supervision of the reference memories.
  • Figure 4: Efficiency-Accuracy Trade-off.
  • Figure 5: Temporal Disentanglement of Preferen. The attention weights across sequence positions reveal specialized temporal roles: Token 0 retains early history (User Identity), Tokens 16/19 focus on recent interactions (Working Memory), and Tokens 3/14 capture diverse periodic patterns (Long-term Habits).
  • ...and 1 more figures