Table of Contents
Fetching ...

EST: Towards Efficient Scaling Laws in Click-Through Rate Prediction via Unified Modeling

Mingyang Liu, Yong Bai, Zhangming Chan, Sishuo Chen, Xiang-Rong Sheng, Han Zhu, Jian Xu, Xinyang Chen

TL;DR

This work tackles efficient scaling for industrial CTR prediction under strict latency by rethinking CTR-LLM distinctions and introducing EST, a fully unified transformer. EST uses Lightweight Cross-Attention to prune redundant self-interactions and Content Sparse Attention to leverage content similarities, enabling long behavioral sequences without prohibitive costs. The approach yields a stable power-law scaling relationship with model size and compute, outperforming state-of-the-art baselines offline and delivering real-world gains in Taobao's online deployment (CTR and RPM improvements). The combination of theoretical insights and practical deployment demonstrates a viable pathway for scalable, high-performance industrial CTR models.

Abstract

Efficiently scaling industrial Click-Through Rate (CTR) prediction has recently attracted significant research attention. Existing approaches typically employ early aggregation of user behaviors to maintain efficiency. However, such non-unified or partially unified modeling creates an information bottleneck by discarding fine-grained, token-level signals essential for unlocking scaling gains. In this work, we revisit the fundamental distinctions between CTR prediction and Large Language Models (LLMs), identifying two critical properties: the asymmetry in information density between behavioral and non-behavioral features, and the modality-specific priors of content-rich signals. Accordingly, we propose the Efficiently Scalable Transformer (EST), which achieves fully unified modeling by processing all raw inputs in a single sequence without lossy aggregation. EST integrates two modules: Lightweight Cross-Attention (LCA), which prunes redundant self-interactions to focus on high-impact cross-feature dependencies, and Content Sparse Attention (CSA), which utilizes content similarity to dynamically select high-signal behaviors. Extensive experiments show that EST exhibits a stable and efficient power-law scaling relationship, enabling predictable performance gains with model scale. Deployed on Taobao's display advertising platform, EST significantly outperforms production baselines, delivering a 3.27\% RPM (Revenue Per Mile) increase and a 1.22\% CTR lift, establishing a practical pathway for scalable industrial CTR prediction models.

EST: Towards Efficient Scaling Laws in Click-Through Rate Prediction via Unified Modeling

TL;DR

This work tackles efficient scaling for industrial CTR prediction under strict latency by rethinking CTR-LLM distinctions and introducing EST, a fully unified transformer. EST uses Lightweight Cross-Attention to prune redundant self-interactions and Content Sparse Attention to leverage content similarities, enabling long behavioral sequences without prohibitive costs. The approach yields a stable power-law scaling relationship with model size and compute, outperforming state-of-the-art baselines offline and delivering real-world gains in Taobao's online deployment (CTR and RPM improvements). The combination of theoretical insights and practical deployment demonstrates a viable pathway for scalable, high-performance industrial CTR models.

Abstract

Efficiently scaling industrial Click-Through Rate (CTR) prediction has recently attracted significant research attention. Existing approaches typically employ early aggregation of user behaviors to maintain efficiency. However, such non-unified or partially unified modeling creates an information bottleneck by discarding fine-grained, token-level signals essential for unlocking scaling gains. In this work, we revisit the fundamental distinctions between CTR prediction and Large Language Models (LLMs), identifying two critical properties: the asymmetry in information density between behavioral and non-behavioral features, and the modality-specific priors of content-rich signals. Accordingly, we propose the Efficiently Scalable Transformer (EST), which achieves fully unified modeling by processing all raw inputs in a single sequence without lossy aggregation. EST integrates two modules: Lightweight Cross-Attention (LCA), which prunes redundant self-interactions to focus on high-impact cross-feature dependencies, and Content Sparse Attention (CSA), which utilizes content similarity to dynamically select high-signal behaviors. Extensive experiments show that EST exhibits a stable and efficient power-law scaling relationship, enabling predictable performance gains with model scale. Deployed on Taobao's display advertising platform, EST significantly outperforms production baselines, delivering a 3.27\% RPM (Revenue Per Mile) increase and a 1.22\% CTR lift, establishing a practical pathway for scalable industrial CTR prediction models.
Paper Structure (26 sections, 13 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 13 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: The CTR prediction in recommendation systems.
  • Figure 2: The architecture of EST. Feature-specific tokenizers map raw features to corresponding tokens. Lightweight Cross-Attention (LCA) captures the interactions between non-behavioral tokens and behavioral tokens. Content Sparse Attention (CSA) calculates intra-sequence content similarity and performs sparse attention within the behavior sequences.
  • Figure 3: Visualization and effective rank of the attention matrix. For visual clarity, only the attention distributions of candidate-specific behavioral tokens and non-behavioral tokens are illustrated, as user-specific behaviors exhibit patterns similar to candidate-specific behaviors.
  • Figure 4: The power-law relationship between computation overhead and GAUC. As the model depth increases, the best-fit curve: $\Delta\text{GAUC} = 0.61 \times C^{0.12}$. As the model width increases, the best-fit curve: $\Delta\text{GAUC} = 0.68 \times C^{0.10}$.
  • Figure 5: The power-law relationship between model capacity and GAUC. As the model depth increases, the best-fit curve: $\Delta\text{GAUC} = 0.46 \times P^{0.14}$. As the model width increases, the best-fit curve: $\Delta\text{GAUC} = 0.63 \times P^{0.08}$.