Table of Contents
Fetching ...

STEC: See-Through Transformer-based Encoder for CTR Prediction

Serdarcan Dilbaz, Hasan Saribas

TL;DR

STEC presents a transformer-inspired encoder for CTR that unifies multiple interaction learning strategies by deriving bilinear interactions from a modified attention mechanism. By stacking $N$ STEC blocks and fusing $N+1$ bilinear interactions across levels, it achieves state-of-the-art or competitive results across four real-world datasets and in production, while maintaining lower computational cost than many attention-based baselines. The architecture includes feature embeddings, a dedicated STEC block that exposes bilinear interactions, FFNs between blocks, and a concatenation-based fusion of multi-level interactions that feed a final MLP. Its explainability via attention weights provides interpretable rationales for recommendations, and ablation studies validate the importance of intermediate interactions and fusion strategy. Overall, STEC delivers a scalable, explainable CTR predictor with strong empirical performance and practical deployment potential.

Abstract

Click-Through Rate (CTR) prediction holds a pivotal place in online advertising and recommender systems since CTR prediction performance directly influences the overall satisfaction of the users and the revenue generated by companies. Even so, CTR prediction is still an active area of research since it involves accurately modelling the preferences of users based on sparse and high-dimensional features where the higher-order interactions of multiple features can lead to different outcomes. Most CTR prediction models have relied on a single fusion and interaction learning strategy. The few CTR prediction models that have utilized multiple interaction modelling strategies have treated each interaction to be self-contained. In this paper, we propose a novel model named STEC that reaps the benefits of multiple interaction learning approaches in a single unified architecture. Additionally, our model introduces residual connections from different orders of interactions which boosts the performance by allowing lower level interactions to directly affect the predictions. Through extensive experiments on four real-world datasets, we demonstrate that STEC outperforms existing state-of-the-art approaches for CTR prediction thanks to its greater expressive capabilities.

STEC: See-Through Transformer-based Encoder for CTR Prediction

TL;DR

STEC presents a transformer-inspired encoder for CTR that unifies multiple interaction learning strategies by deriving bilinear interactions from a modified attention mechanism. By stacking STEC blocks and fusing bilinear interactions across levels, it achieves state-of-the-art or competitive results across four real-world datasets and in production, while maintaining lower computational cost than many attention-based baselines. The architecture includes feature embeddings, a dedicated STEC block that exposes bilinear interactions, FFNs between blocks, and a concatenation-based fusion of multi-level interactions that feed a final MLP. Its explainability via attention weights provides interpretable rationales for recommendations, and ablation studies validate the importance of intermediate interactions and fusion strategy. Overall, STEC delivers a scalable, explainable CTR predictor with strong empirical performance and practical deployment potential.

Abstract

Click-Through Rate (CTR) prediction holds a pivotal place in online advertising and recommender systems since CTR prediction performance directly influences the overall satisfaction of the users and the revenue generated by companies. Even so, CTR prediction is still an active area of research since it involves accurately modelling the preferences of users based on sparse and high-dimensional features where the higher-order interactions of multiple features can lead to different outcomes. Most CTR prediction models have relied on a single fusion and interaction learning strategy. The few CTR prediction models that have utilized multiple interaction modelling strategies have treated each interaction to be self-contained. In this paper, we propose a novel model named STEC that reaps the benefits of multiple interaction learning approaches in a single unified architecture. Additionally, our model introduces residual connections from different orders of interactions which boosts the performance by allowing lower level interactions to directly affect the predictions. Through extensive experiments on four real-world datasets, we demonstrate that STEC outperforms existing state-of-the-art approaches for CTR prediction thanks to its greater expressive capabilities.
Paper Structure (25 sections, 12 equations, 4 figures, 6 tables)

This paper contains 25 sections, 12 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The overall STEC architecture uses $N$ stacked STEC blocks and fuses $N+1$ group bilinear interactions from different levels to form a single CTR prediction.
  • Figure 2: STEC outperforms other attention-based models in terms of AUC and logloss with lower FLOPs.
  • Figure 3: Heat maps of attention weights for three independent cases on Frappe. The tick labels correspond to the feature fields user, item, daytime, weekday, isweekend, homework, cost, weather, country, city.
  • Figure :