Table of Contents
Fetching ...

LREA: Low-Rank Efficient Attention on Modeling Long-Term User Behaviors for CTR Prediction

Xin Song, Xiaochen Li, Jinxin Hu, Hong Wen, Zulong Chen, Yu Zhang, Xiaoyi Zeng, Jing Zhang

TL;DR

The paper tackles the challenge of leveraging long-term user history for CTR prediction without sacrificing online latency. It introduces Low-Rank Efficient Attention (LREA), a long-sequence attention mechanism based on low-rank decomposition and matrix absorption, paired with a non-negativity loss to preserve attention quality. By offline pre-compressing and caching representations, LREA enables fast online inference over very long sequences and demonstrates superior or competitive performance against state-of-the-art baselines on both public and industrial data, including meaningful online CTR and RPM gains with minimal latency overhead. This work enables practical deployment of full long-term user behavior modeling in real-time CTR systems, offering accuracy improvements at scale with controlled computation.

Abstract

With the rapid growth of user historical behavior data, user interest modeling has become a prominent aspect in Click-Through Rate (CTR) prediction, focusing on learning user intent representations. However, this complexity poses computational challenges, requiring a balance between model performance and acceptable response times for online services. Traditional methods often utilize filtering techniques. These techniques can lead to the loss of significant information by prioritizing top K items based on item attributes or employing low-precision attention mechanisms. In this study, we introduce LREA, a novel attention mechanism that overcomes the limitations of existing approaches while ensuring computational efficiency. LREA leverages low-rank matrix decomposition to optimize runtime performance and incorporates a specially designed loss function to maintain attention capabilities while preserving information integrity. During the inference phase, matrix absorption and pre-storage strategies are employed to effectively meet runtime constraints. The results of extensive offline and online experiments demonstrate that our method outperforms state-of-the-art approaches.

LREA: Low-Rank Efficient Attention on Modeling Long-Term User Behaviors for CTR Prediction

TL;DR

The paper tackles the challenge of leveraging long-term user history for CTR prediction without sacrificing online latency. It introduces Low-Rank Efficient Attention (LREA), a long-sequence attention mechanism based on low-rank decomposition and matrix absorption, paired with a non-negativity loss to preserve attention quality. By offline pre-compressing and caching representations, LREA enables fast online inference over very long sequences and demonstrates superior or competitive performance against state-of-the-art baselines on both public and industrial data, including meaningful online CTR and RPM gains with minimal latency overhead. This work enables practical deployment of full long-term user behavior modeling in real-time CTR systems, offering accuracy improvements at scale with controlled computation.

Abstract

With the rapid growth of user historical behavior data, user interest modeling has become a prominent aspect in Click-Through Rate (CTR) prediction, focusing on learning user intent representations. However, this complexity poses computational challenges, requiring a balance between model performance and acceptable response times for online services. Traditional methods often utilize filtering techniques. These techniques can lead to the loss of significant information by prioritizing top K items based on item attributes or employing low-precision attention mechanisms. In this study, we introduce LREA, a novel attention mechanism that overcomes the limitations of existing approaches while ensuring computational efficiency. LREA leverages low-rank matrix decomposition to optimize runtime performance and incorporates a specially designed loss function to maintain attention capabilities while preserving information integrity. During the inference phase, matrix absorption and pre-storage strategies are employed to effectively meet runtime constraints. The results of extensive offline and online experiments demonstrate that our method outperforms state-of-the-art approaches.

Paper Structure

This paper contains 11 sections, 11 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: The structure of LREA. In the off-line training stage, LREA updates matrices $W_{Comp}$ and $W_{Decomp}$ without computational optimization, subsequently caching them with $E_{s}$ and $E_{s}^T$ in HBM. During offline inference, LREA utilizes compressed sequences $E_{Auxabsorb}$ and $E_{Comp}$, which reduce the length of user behavior sequence, enhancing the computation efficiency.