Table of Contents
Fetching ...

From Scaling to Structured Expressivity: Rethinking Transformers for CTR Prediction

Bencheng Yan, Yuejie Lei, Zhiyuan Zeng, Di Wang, Kaiyi Lin, Pengjie Wang, Jian Xu, Bo Zheng

TL;DR

The paper addresses the mismatch between standard Transformer assumptions and CTR data, arguing that CTR requires structured, field-aware interactions rather than pure sequential modeling. It introduces the Field-Aware Transformer (FAT), which uses field-aware content alignment and field-pair interaction modulation, plus a basis-composed hypernetwork to keep parameter growth tied to the number of semantic fields $F$ rather than the vocabulary size $n$. The authors prove a principled generalization bound via Rademacher complexity, predicting a power-law scaling of performance with model width, and validate FAT on a large-scale Taobao dataset, achieving up to +0.51% AUC improvements and significant online gains (e.g., +2.33% CTR, +0.66% RPM). They further show FAT’s interpretability through structured, asymmetric cross-field interaction patterns and demonstrate scalable, production-ready parameter generation with zero serving overhead. Overall, the work demonstrates that scalable, predictable CTR performance comes from architecting models that align with data semantics rather than indiscriminately increasing size.

Abstract

Despite massive investments in scale, deep models for click-through rate (CTR) prediction often exhibit rapidly diminishing returns - a stark contrast to the smooth, predictable gains seen in large language models. We identify the root cause as a structural misalignment: Transformers assume sequential compositionality, while CTR data demand combinatorial reasoning over high-cardinality semantic fields. Unstructured attention spreads capacity indiscriminately, amplifying noise under extreme sparsity and breaking scalable learning. To restore alignment, we introduce the Field-Aware Transformer (FAT), which embeds field-based interaction priors into attention through decomposed content alignment and cross-field modulation. This design ensures model complexity scales with the number of fields F, not the total vocabulary size n >> F, leading to tighter generalization and, critically, observed power-law scaling in AUC as model width increases. We present the first formal scaling law for CTR models, grounded in Rademacher complexity, that explains and predicts this behavior. On large-scale benchmarks, FAT improves AUC by up to +0.51% over state-of-the-art methods. Deployed online, it delivers +2.33% CTR and +0.66% RPM. Our work establishes that effective scaling in recommendation arises not from size, but from structured expressivity-architectural coherence with data semantics.

From Scaling to Structured Expressivity: Rethinking Transformers for CTR Prediction

TL;DR

The paper addresses the mismatch between standard Transformer assumptions and CTR data, arguing that CTR requires structured, field-aware interactions rather than pure sequential modeling. It introduces the Field-Aware Transformer (FAT), which uses field-aware content alignment and field-pair interaction modulation, plus a basis-composed hypernetwork to keep parameter growth tied to the number of semantic fields rather than the vocabulary size . The authors prove a principled generalization bound via Rademacher complexity, predicting a power-law scaling of performance with model width, and validate FAT on a large-scale Taobao dataset, achieving up to +0.51% AUC improvements and significant online gains (e.g., +2.33% CTR, +0.66% RPM). They further show FAT’s interpretability through structured, asymmetric cross-field interaction patterns and demonstrate scalable, production-ready parameter generation with zero serving overhead. Overall, the work demonstrates that scalable, predictable CTR performance comes from architecting models that align with data semantics rather than indiscriminately increasing size.

Abstract

Despite massive investments in scale, deep models for click-through rate (CTR) prediction often exhibit rapidly diminishing returns - a stark contrast to the smooth, predictable gains seen in large language models. We identify the root cause as a structural misalignment: Transformers assume sequential compositionality, while CTR data demand combinatorial reasoning over high-cardinality semantic fields. Unstructured attention spreads capacity indiscriminately, amplifying noise under extreme sparsity and breaking scalable learning. To restore alignment, we introduce the Field-Aware Transformer (FAT), which embeds field-based interaction priors into attention through decomposed content alignment and cross-field modulation. This design ensures model complexity scales with the number of fields F, not the total vocabulary size n >> F, leading to tighter generalization and, critically, observed power-law scaling in AUC as model width increases. We present the first formal scaling law for CTR models, grounded in Rademacher complexity, that explains and predicts this behavior. On large-scale benchmarks, FAT improves AUC by up to +0.51% over state-of-the-art methods. Deployed online, it delivers +2.33% CTR and +0.66% RPM. Our work establishes that effective scaling in recommendation arises not from size, but from structured expressivity-architectural coherence with data semantics.

Paper Structure

This paper contains 27 sections, 2 theorems, 19 equations, 3 figures, 4 tables.

Key Result

theorem 1

Let $\mathcal{D}$ be a distribution over input sequences $\mathbf{H} = [\mathbf{h}_1, ..., \mathbf{h}_N]$ with $\|\mathbf{h}_i\|_2 \leq R$. Assume all parameter matrices ($\mathbf{W}^{(f)}_Q, \mathbf{W}^{(f)}_K, \mathbf{W}^{(f)}_V$) have Frobenius norm bounded by $B$, and all interaction scalars ($w where $C(R, B, B_w, d) = \mathcal{O}(R^2 B^2 B_w \sqrt{d} + R B B_w)$ is a constant depending on th

Figures (3)

  • Figure 1: The architecture of FAT.
  • Figure 2: Heatmap of $w_{f_i,f_j}$. Strong interactions (dark) align with expected semantic dependencies (e.g., item and real-time interest).
  • Figure 3: Power-law relationship between parameter count and AUC. Best-fit slope $\Delta AUC =5.81 \times 10^{-5} \cdot N_{params}^{0.433}$.

Theorems & Definitions (2)

  • theorem 1: Generalization Bound for FAT
  • theorem 2: Generalization Bound for FAT