LASER: An Efficient Target-Aware Segmented Attention Framework for End-to-End Long Sequence Modeling
Tianhe Lin, Ziwei Xiong, Baoyuan Ou, Yingjie Qin, Lai Xu, Xiaocheng Zhong, Yao Hu, Zhiyong Wang, Tao Zhou, Yubin Xu, Di Wu
TL;DR
LASER tackles end-to-end modeling of ultra-long user sequences in industrial recommender systems by unifying real-time long-history access (SeqVault) with a compress-then-refine attention architecture. The Segmented Target Attention (STA) performs noise-resilient, low-cost local compression, followed by a Global Stacked Target Attention (GSTA) that refines cross-segment dependencies, all feeding a multi-resolution fusion for CTR prediction. Production deployment demonstrates major efficiency gains (latency and CPU usage) and strong business impact, including offline AUC gains of up to +0.24% and online lifts in advertiser value (+2.36%) and revenue (+2.08%). Overall, LASER establishes a scalable infrastructure for end-to-end long-sequence modeling and sets a foundation for multi-scenario, lifelong user understanding in large-scale systems.
Abstract
Modeling ultra-long user behavior sequences is pivotal for capturing evolving and lifelong interests in modern recommendation systems. However, deploying such models in real-time industrial environments faces a strict "Latency Wall", constrained by two distinct bottlenecks: the high I/O latency of retrieving massive user histories and the quadratic computational complexity of standard attention mechanisms. To break these bottlenecks, we present LASER, a full-stack optimization framework developed and deployed at Xiaohongshu (RedNote). Our approach tackles the challenges through two complementary innovations: (1) System efficiency: We introduce SeqVault, a unified schema-aware serving infrastructure for long user histories. By implementing a hybrid DRAM-SSD indexing strategy, SeqVault reduces retrieval latency by 50% and CPU usage by 75%, ensuring millisecond-level access to full real-time and life-cycle user histories. (2) Algorithmic efficiency: We propose a Segmented Target Attention (STA) mechanism to address the computational overhead. Motivated by the inherent sparsity of user interests, STA employs a sigmoid-based gating strategy that acts as a silence mechanism to filter out noisy items. Subsequently, a lightweight Global Stacked Target Attention (GSTA) module refines these compressed segments to capture cross-segment dependencies without incurring high computational costs. This design performs effective sequence compression, reducing the complexity of long-sequence modeling while preserving critical signals. Extensive offline evaluations demonstrate that LASER consistently outperforms state-of-the-art baselines. In large-scale online A/B testing serving over 100 million daily active users, LASER achieved a 2.36% lift in ADVV and a 2.08% lift in revenue, demonstrating its scalability and significant commercial impact.
