LREA: Low-Rank Efficient Attention on Modeling Long-Term User Behaviors for CTR Prediction
Xin Song, Xiaochen Li, Jinxin Hu, Hong Wen, Zulong Chen, Yu Zhang, Xiaoyi Zeng, Jing Zhang
TL;DR
The paper tackles the challenge of leveraging long-term user history for CTR prediction without sacrificing online latency. It introduces Low-Rank Efficient Attention (LREA), a long-sequence attention mechanism based on low-rank decomposition and matrix absorption, paired with a non-negativity loss to preserve attention quality. By offline pre-compressing and caching representations, LREA enables fast online inference over very long sequences and demonstrates superior or competitive performance against state-of-the-art baselines on both public and industrial data, including meaningful online CTR and RPM gains with minimal latency overhead. This work enables practical deployment of full long-term user behavior modeling in real-time CTR systems, offering accuracy improvements at scale with controlled computation.
Abstract
With the rapid growth of user historical behavior data, user interest modeling has become a prominent aspect in Click-Through Rate (CTR) prediction, focusing on learning user intent representations. However, this complexity poses computational challenges, requiring a balance between model performance and acceptable response times for online services. Traditional methods often utilize filtering techniques. These techniques can lead to the loss of significant information by prioritizing top K items based on item attributes or employing low-precision attention mechanisms. In this study, we introduce LREA, a novel attention mechanism that overcomes the limitations of existing approaches while ensuring computational efficiency. LREA leverages low-rank matrix decomposition to optimize runtime performance and incorporates a specially designed loss function to maintain attention capabilities while preserving information integrity. During the inference phase, matrix absorption and pre-storage strategies are employed to effectively meet runtime constraints. The results of extensive offline and online experiments demonstrate that our method outperforms state-of-the-art approaches.
