Table of Contents
Fetching ...

CADET: Context-Conditioned Ads CTR Prediction With a Decoder-Only Transformer

David Pardoe, Neil Daftary, Miro Furtado, Aditya Aiyer, Yu Wang, Liuqing Li, Tao Song, Lars Hertel, Young Jin Yun, Senthil Radhakrishnan, Zhiwei Wang, Tommy Li, Khai Tran, Ananth Nagarajan, Ali Naqvi, Yue Zhang, Renpeng Fang, Avi Romascanu, Arjun Kulothungun, Deepak Kumar, Praneeth Boda, Fedor Borisyuk, Ruoyan Wang

TL;DR

CADET addresses the challenge of predicting ads CTR with a decoder-only Transformer by introducing a context-conditioned decoding mechanism that accounts for post-scoring signals like ad position. It combines self-gated attention, timestamp-based RoPE, and session-aware masking to stabilize training and ensure offline-online consistency, while production techniques such as packing, chunking, and a custom Flash Attention kernel enable scalable training and low-latency serving. The model uses context-conditioned prediction heads and auxiliary tasks, optimized with a RankNet-style pairwise loss, and achieves an online CTR lift of $+11.04\%$ over the production LiRank baseline, validating its effectiveness in industrial settings. Overall, CADET demonstrates that a unified generative decoder can outperform a multi-component DLRM ensemble for large-scale ads ranking, with practical benefits in efficiency and deployment at LinkedIn.

Abstract

Click-through rate (CTR) prediction is fundamental to online advertising systems. While Deep Learning Recommendation Models (DLRMs) with explicit feature interactions have long dominated this domain, recent advances in generative recommenders have shown promising results in content recommendation. However, adapting these transformer-based architectures to ads CTR prediction still presents unique challenges, including handling post-scoring contextual signals, maintaining offline-online consistency, and scaling to industrial workloads. We present CADET (Context-Conditioned Ads Decoder-Only Transformer), an end-to-end decoder-only transformer for ads CTR prediction deployed at LinkedIn. Our approach introduces several key innovations: (1) a context-conditioned decoding architecture with multi-tower prediction heads that explicitly model post-scoring signals such as ad position, resolving the chicken-and-egg problem between predicted CTR and ranking; (2) a self-gated attention mechanism that stabilizes training by adaptively regulating information flow at both representation and interaction levels; (3) a timestamp-based variant of Rotary Position Embedding (RoPE) that captures temporal relationships across timescales from seconds to months; (4) session masking strategies that prevent the model from learning dependencies on unavailable in-session events, addressing train-serve skew; and (5) production engineering techniques including tensor packing, sequence chunking, and custom Flash Attention kernels that enable efficient training and serving at scale. In online A/B testing, CADET achieves a 11.04\% CTR lift compared to the production LiRank baseline model, a hybrid ensemble of DCNv2 and sequential encoders. The system has been successfully deployed on LinkedIn's advertising platform, serving the main traffic for homefeed sponsored updates.

CADET: Context-Conditioned Ads CTR Prediction With a Decoder-Only Transformer

TL;DR

CADET addresses the challenge of predicting ads CTR with a decoder-only Transformer by introducing a context-conditioned decoding mechanism that accounts for post-scoring signals like ad position. It combines self-gated attention, timestamp-based RoPE, and session-aware masking to stabilize training and ensure offline-online consistency, while production techniques such as packing, chunking, and a custom Flash Attention kernel enable scalable training and low-latency serving. The model uses context-conditioned prediction heads and auxiliary tasks, optimized with a RankNet-style pairwise loss, and achieves an online CTR lift of over the production LiRank baseline, validating its effectiveness in industrial settings. Overall, CADET demonstrates that a unified generative decoder can outperform a multi-component DLRM ensemble for large-scale ads ranking, with practical benefits in efficiency and deployment at LinkedIn.

Abstract

Click-through rate (CTR) prediction is fundamental to online advertising systems. While Deep Learning Recommendation Models (DLRMs) with explicit feature interactions have long dominated this domain, recent advances in generative recommenders have shown promising results in content recommendation. However, adapting these transformer-based architectures to ads CTR prediction still presents unique challenges, including handling post-scoring contextual signals, maintaining offline-online consistency, and scaling to industrial workloads. We present CADET (Context-Conditioned Ads Decoder-Only Transformer), an end-to-end decoder-only transformer for ads CTR prediction deployed at LinkedIn. Our approach introduces several key innovations: (1) a context-conditioned decoding architecture with multi-tower prediction heads that explicitly model post-scoring signals such as ad position, resolving the chicken-and-egg problem between predicted CTR and ranking; (2) a self-gated attention mechanism that stabilizes training by adaptively regulating information flow at both representation and interaction levels; (3) a timestamp-based variant of Rotary Position Embedding (RoPE) that captures temporal relationships across timescales from seconds to months; (4) session masking strategies that prevent the model from learning dependencies on unavailable in-session events, addressing train-serve skew; and (5) production engineering techniques including tensor packing, sequence chunking, and custom Flash Attention kernels that enable efficient training and serving at scale. In online A/B testing, CADET achieves a 11.04\% CTR lift compared to the production LiRank baseline model, a hybrid ensemble of DCNv2 and sequential encoders. The system has been successfully deployed on LinkedIn's advertising platform, serving the main traffic for homefeed sponsored updates.
Paper Structure (49 sections, 14 equations, 5 figures, 3 tables)

This paper contains 49 sections, 14 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Architecture of the proposed model. (a) Overall architecture showing the interleaved impression-action sequence processing through the decoder-only transformer. (b) Detailed view of the context-conditioned decoding block with multiple prediction heads.
  • Figure 2: Detailed description of self-gated attention module
  • Figure 3: Session-aware masking strategies (depicted before item/action interleaving). Solid yellow cells are always unmasked. Striped cells fall within the $\Delta_{\text{delay}}$ threshold, masked during training, but available at inference. Gray cells are masked and dark red cells denote the preserved diagonal.
  • Figure 4: Transition from padded to packed representation. Each color denotes a distinct user sequence; packing eliminates the gray padding and stores only real tokens.
  • Figure 5: Training AUC curves before 20K steps under different self-gating configurations. Representative samples from multiple runs that exhibited consistent dynamics.