Table of Contents
Fetching ...

LASER: An Efficient Target-Aware Segmented Attention Framework for End-to-End Long Sequence Modeling

Tianhe Lin, Ziwei Xiong, Baoyuan Ou, Yingjie Qin, Lai Xu, Xiaocheng Zhong, Yao Hu, Zhiyong Wang, Tao Zhou, Yubin Xu, Di Wu

TL;DR

LASER tackles end-to-end modeling of ultra-long user sequences in industrial recommender systems by unifying real-time long-history access (SeqVault) with a compress-then-refine attention architecture. The Segmented Target Attention (STA) performs noise-resilient, low-cost local compression, followed by a Global Stacked Target Attention (GSTA) that refines cross-segment dependencies, all feeding a multi-resolution fusion for CTR prediction. Production deployment demonstrates major efficiency gains (latency and CPU usage) and strong business impact, including offline AUC gains of up to +0.24% and online lifts in advertiser value (+2.36%) and revenue (+2.08%). Overall, LASER establishes a scalable infrastructure for end-to-end long-sequence modeling and sets a foundation for multi-scenario, lifelong user understanding in large-scale systems.

Abstract

Modeling ultra-long user behavior sequences is pivotal for capturing evolving and lifelong interests in modern recommendation systems. However, deploying such models in real-time industrial environments faces a strict "Latency Wall", constrained by two distinct bottlenecks: the high I/O latency of retrieving massive user histories and the quadratic computational complexity of standard attention mechanisms. To break these bottlenecks, we present LASER, a full-stack optimization framework developed and deployed at Xiaohongshu (RedNote). Our approach tackles the challenges through two complementary innovations: (1) System efficiency: We introduce SeqVault, a unified schema-aware serving infrastructure for long user histories. By implementing a hybrid DRAM-SSD indexing strategy, SeqVault reduces retrieval latency by 50% and CPU usage by 75%, ensuring millisecond-level access to full real-time and life-cycle user histories. (2) Algorithmic efficiency: We propose a Segmented Target Attention (STA) mechanism to address the computational overhead. Motivated by the inherent sparsity of user interests, STA employs a sigmoid-based gating strategy that acts as a silence mechanism to filter out noisy items. Subsequently, a lightweight Global Stacked Target Attention (GSTA) module refines these compressed segments to capture cross-segment dependencies without incurring high computational costs. This design performs effective sequence compression, reducing the complexity of long-sequence modeling while preserving critical signals. Extensive offline evaluations demonstrate that LASER consistently outperforms state-of-the-art baselines. In large-scale online A/B testing serving over 100 million daily active users, LASER achieved a 2.36% lift in ADVV and a 2.08% lift in revenue, demonstrating its scalability and significant commercial impact.

LASER: An Efficient Target-Aware Segmented Attention Framework for End-to-End Long Sequence Modeling

TL;DR

LASER tackles end-to-end modeling of ultra-long user sequences in industrial recommender systems by unifying real-time long-history access (SeqVault) with a compress-then-refine attention architecture. The Segmented Target Attention (STA) performs noise-resilient, low-cost local compression, followed by a Global Stacked Target Attention (GSTA) that refines cross-segment dependencies, all feeding a multi-resolution fusion for CTR prediction. Production deployment demonstrates major efficiency gains (latency and CPU usage) and strong business impact, including offline AUC gains of up to +0.24% and online lifts in advertiser value (+2.36%) and revenue (+2.08%). Overall, LASER establishes a scalable infrastructure for end-to-end long-sequence modeling and sets a foundation for multi-scenario, lifelong user understanding in large-scale systems.

Abstract

Modeling ultra-long user behavior sequences is pivotal for capturing evolving and lifelong interests in modern recommendation systems. However, deploying such models in real-time industrial environments faces a strict "Latency Wall", constrained by two distinct bottlenecks: the high I/O latency of retrieving massive user histories and the quadratic computational complexity of standard attention mechanisms. To break these bottlenecks, we present LASER, a full-stack optimization framework developed and deployed at Xiaohongshu (RedNote). Our approach tackles the challenges through two complementary innovations: (1) System efficiency: We introduce SeqVault, a unified schema-aware serving infrastructure for long user histories. By implementing a hybrid DRAM-SSD indexing strategy, SeqVault reduces retrieval latency by 50% and CPU usage by 75%, ensuring millisecond-level access to full real-time and life-cycle user histories. (2) Algorithmic efficiency: We propose a Segmented Target Attention (STA) mechanism to address the computational overhead. Motivated by the inherent sparsity of user interests, STA employs a sigmoid-based gating strategy that acts as a silence mechanism to filter out noisy items. Subsequently, a lightweight Global Stacked Target Attention (GSTA) module refines these compressed segments to capture cross-segment dependencies without incurring high computational costs. This design performs effective sequence compression, reducing the complexity of long-sequence modeling while preserving critical signals. Extensive offline evaluations demonstrate that LASER consistently outperforms state-of-the-art baselines. In large-scale online A/B testing serving over 100 million daily active users, LASER achieved a 2.36% lift in ADVV and a 2.08% lift in revenue, demonstrating its scalability and significant commercial impact.
Paper Structure (40 sections, 15 equations, 3 figures, 5 tables)

This paper contains 40 sections, 15 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of our LASER framework. The architecture comprises four key components: (1) The SeqVault service efficiently retrieves real-time user behavior sequences and side information. (2) The Segmented Target Attention (STA) module performs hierarchical processing, enabling sequence compression while preserving local patterns. (3) The Global Stacked Target Attention (GSTA) module conducts fine-grained interaction modeling on the compressed sequence. (4) The final representation consists of the attention output and the embeddings obtained by the multi-resolution feature fusion module, and is fed into RankMixer for CTR prediction.
  • Figure 2: Comparison of traditional fragmented LastN infrastructures (past) with the unified SeqVault service (now), showcasing resource-efficient, real-time long sequence modeling and 10× cost reduction through centralized sideinfo management.
  • Figure 3: Scaling analysis of LASER with varying number of layers and sequence lengths. (a) and (b) demonstrate that increasing model depth improves AUC with diminishing returns, while computational cost scales linearly. (c) and (d) show AUC follows a power-law relationship with increasing sequence length.