GEMs: Breaking the Long-Sequence Barrier in Generative Recommendation with a Multi-Stream Decoder

Yu Zhou; Chengcheng Guo; Kuo Cai; Ji Liu; Qiang Luo; Ruiming Tang; Han Li; Kun Gai; Guorui Zhou

GEMs: Breaking the Long-Sequence Barrier in Generative Recommendation with a Multi-Stream Decoder

Yu Zhou, Chengcheng Guo, Kuo Cai, Ji Liu, Qiang Luo, Ruiming Tang, Han Li, Kun Gai, Guorui Zhou

TL;DR

GEMs is the first lifelong GR framework successfully deployed in a high-concurrency industrial environment, achieving superior inference efficiency while processing user sequences of over 100,000 interactions.

Abstract

While generative recommendations (GR) possess strong sequential reasoning capabilities, they face significant challenges when processing extremely long user behavior sequences: the high computational cost forces practical sequence lengths to be limited, preventing models from capturing users' lifelong interests; meanwhile, the inherent "recency bias" of attention mechanisms further weakens learning from long-term history. To overcome this bottleneck, we propose GEMs (Generative rEcommendation with a Multi-stream decoder), a novel and unified framework designed to break the long-sequence barrier by capturing users' lifelong interaction sequences through a multi-stream perspective. Specifically, GEMs partitions user behaviors into three temporal streams$\unicode{x2014}$Recent, Mid-term, and Lifecycle$\unicode{x2014}$and employs tailored inference schemes for each: a one-stage real-time extractor for immediate dynamics, a lightweight indexer for cross attention to balance accuracy and cost for mid-term sequences, and a two-stage offline-online compression module for lifelong modeling. These streams are integrated via a parameter-free fusion strategy to enable holistic interest representation. Extensive experiments on large-scale industrial datasets demonstrate that GEMs significantly outperforms state-of-the-art methods in recommendation accuracy. Notably, GEMs is the first lifelong GR framework successfully deployed in a high-concurrency industrial environment, achieving superior inference efficiency while processing user sequences of over 100,000 interactions.

GEMs: Breaking the Long-Sequence Barrier in Generative Recommendation with a Multi-Stream Decoder

TL;DR

Abstract

Recent, Mid-term, and Lifecycle

and employs tailored inference schemes for each: a one-stage real-time extractor for immediate dynamics, a lightweight indexer for cross attention to balance accuracy and cost for mid-term sequences, and a two-stage offline-online compression module for lifelong modeling. These streams are integrated via a parameter-free fusion strategy to enable holistic interest representation. Extensive experiments on large-scale industrial datasets demonstrate that GEMs significantly outperforms state-of-the-art methods in recommendation accuracy. Notably, GEMs is the first lifelong GR framework successfully deployed in a high-concurrency industrial environment, achieving superior inference efficiency while processing user sequences of over 100,000 interactions.

Paper Structure (28 sections, 13 equations, 5 figures, 6 tables)

This paper contains 28 sections, 13 equations, 5 figures, 6 tables.

Introduction
Preliminary
Methods
Overview
Sequence Segmentation
Multi-Stream Framework
Lightweight Indexer for Cross Attention
Parameter-Free Fusion
Training and Inference Strategy
Experiments
Dataset
Metrics
Overall Performance (RQ1)
Baselines.
Results.
...and 13 more sections

Figures (5)

Figure 1: Analyzing the effect of sequence length on training loss and serving resources. GEMs can handle longer user history sequences with fewer inference resources, thereby improving performance on recommendation tasks.
Figure 2: Overall framework of GEMs. (a): The user sequence is partitioned iinto three segments—Recent, Mid-term, and Lifecycle. (b): Interest extraction from users' lifelong sequences. We derive corresponding interest representations through dedicated encoders. (c): A multi-stream decoder for predicting the user’s next item of interest. All streams are then fused via parameter-free fusion to generate the final prediction. (d): Implementation of the Indexer CA block. This module adopts an indexer-selector design that incorporates a pre-filtering strategy on the full-length key-value sequences, significantly reducing the computational cost of Multi-Head Attention.
Figure 3: Comparison of fusion strategies. (a): All streams share a single encoder and a single decoder. (b): Separate encoders are used for the three streams, and their concatenated outputs are fed into a joint decoder. (c): Separate encoders and cross-attentions, whose outputs are fused fused with gated weighting. (d): Our proposed parameter-free fusion strategy, where each stream employs independent cross-attentions, whose outputs are fused without learnable parameters.
Figure 4: Analysis of fusion stageties. The model tends to allocate more attention to the recent stream.
Figure 5: Analysis of model size. The model's capacity increases with scaling.

GEMs: Breaking the Long-Sequence Barrier in Generative Recommendation with a Multi-Stream Decoder

TL;DR

Abstract

GEMs: Breaking the Long-Sequence Barrier in Generative Recommendation with a Multi-Stream Decoder

Authors

TL;DR

Abstract

Table of Contents

Figures (5)