From Scaling to Structured Expressivity: Rethinking Transformers for CTR Prediction
Bencheng Yan, Yuejie Lei, Zhiyuan Zeng, Di Wang, Kaiyi Lin, Pengjie Wang, Jian Xu, Bo Zheng
TL;DR
The paper addresses the mismatch between standard Transformer assumptions and CTR data, arguing that CTR requires structured, field-aware interactions rather than pure sequential modeling. It introduces the Field-Aware Transformer (FAT), which uses field-aware content alignment and field-pair interaction modulation, plus a basis-composed hypernetwork to keep parameter growth tied to the number of semantic fields $F$ rather than the vocabulary size $n$. The authors prove a principled generalization bound via Rademacher complexity, predicting a power-law scaling of performance with model width, and validate FAT on a large-scale Taobao dataset, achieving up to +0.51% AUC improvements and significant online gains (e.g., +2.33% CTR, +0.66% RPM). They further show FAT’s interpretability through structured, asymmetric cross-field interaction patterns and demonstrate scalable, production-ready parameter generation with zero serving overhead. Overall, the work demonstrates that scalable, predictable CTR performance comes from architecting models that align with data semantics rather than indiscriminately increasing size.
Abstract
Despite massive investments in scale, deep models for click-through rate (CTR) prediction often exhibit rapidly diminishing returns - a stark contrast to the smooth, predictable gains seen in large language models. We identify the root cause as a structural misalignment: Transformers assume sequential compositionality, while CTR data demand combinatorial reasoning over high-cardinality semantic fields. Unstructured attention spreads capacity indiscriminately, amplifying noise under extreme sparsity and breaking scalable learning. To restore alignment, we introduce the Field-Aware Transformer (FAT), which embeds field-based interaction priors into attention through decomposed content alignment and cross-field modulation. This design ensures model complexity scales with the number of fields F, not the total vocabulary size n >> F, leading to tighter generalization and, critically, observed power-law scaling in AUC as model width increases. We present the first formal scaling law for CTR models, grounded in Rademacher complexity, that explains and predicts this behavior. On large-scale benchmarks, FAT improves AUC by up to +0.51% over state-of-the-art methods. Deployed online, it delivers +2.33% CTR and +0.66% RPM. Our work establishes that effective scaling in recommendation arises not from size, but from structured expressivity-architectural coherence with data semantics.
