STEC: See-Through Transformer-based Encoder for CTR Prediction

Serdarcan Dilbaz; Hasan Saribas

STEC: See-Through Transformer-based Encoder for CTR Prediction

Serdarcan Dilbaz, Hasan Saribas

TL;DR

STEC presents a transformer-inspired encoder for CTR that unifies multiple interaction learning strategies by deriving bilinear interactions from a modified attention mechanism. By stacking $N$ STEC blocks and fusing $N+1$ bilinear interactions across levels, it achieves state-of-the-art or competitive results across four real-world datasets and in production, while maintaining lower computational cost than many attention-based baselines. The architecture includes feature embeddings, a dedicated STEC block that exposes bilinear interactions, FFNs between blocks, and a concatenation-based fusion of multi-level interactions that feed a final MLP. Its explainability via attention weights provides interpretable rationales for recommendations, and ablation studies validate the importance of intermediate interactions and fusion strategy. Overall, STEC delivers a scalable, explainable CTR predictor with strong empirical performance and practical deployment potential.

Abstract

Click-Through Rate (CTR) prediction holds a pivotal place in online advertising and recommender systems since CTR prediction performance directly influences the overall satisfaction of the users and the revenue generated by companies. Even so, CTR prediction is still an active area of research since it involves accurately modelling the preferences of users based on sparse and high-dimensional features where the higher-order interactions of multiple features can lead to different outcomes. Most CTR prediction models have relied on a single fusion and interaction learning strategy. The few CTR prediction models that have utilized multiple interaction modelling strategies have treated each interaction to be self-contained. In this paper, we propose a novel model named STEC that reaps the benefits of multiple interaction learning approaches in a single unified architecture. Additionally, our model introduces residual connections from different orders of interactions which boosts the performance by allowing lower level interactions to directly affect the predictions. Through extensive experiments on four real-world datasets, we demonstrate that STEC outperforms existing state-of-the-art approaches for CTR prediction thanks to its greater expressive capabilities.

STEC: See-Through Transformer-based Encoder for CTR Prediction

TL;DR

STEC presents a transformer-inspired encoder for CTR that unifies multiple interaction learning strategies by deriving bilinear interactions from a modified attention mechanism. By stacking

STEC blocks and fusing

bilinear interactions across levels, it achieves state-of-the-art or competitive results across four real-world datasets and in production, while maintaining lower computational cost than many attention-based baselines. The architecture includes feature embeddings, a dedicated STEC block that exposes bilinear interactions, FFNs between blocks, and a concatenation-based fusion of multi-level interactions that feed a final MLP. Its explainability via attention weights provides interpretable rationales for recommendations, and ablation studies validate the importance of intermediate interactions and fusion strategy. Overall, STEC delivers a scalable, explainable CTR predictor with strong empirical performance and practical deployment potential.

Abstract

Paper Structure (25 sections, 12 equations, 4 figures, 6 tables)

This paper contains 25 sections, 12 equations, 4 figures, 6 tables.

Introduction
Related Works
Bilinear Interaction
Attention Networks
Our Proposed Model
Feature Embedding
STEC Block
STEC Architecture
Position-wise Feed-Forward Networks
Final Bilinear Interaction Layer
Concatenation Layer
Experimental Results
Experiment Setup
Datasets
Training Objective
...and 10 more sections

Figures (4)

Figure 1: The overall STEC architecture uses $N$ stacked STEC blocks and fuses $N+1$ group bilinear interactions from different levels to form a single CTR prediction.
Figure 2: STEC outperforms other attention-based models in terms of AUC and logloss with lower FLOPs.
Figure 3: Heat maps of attention weights for three independent cases on Frappe. The tick labels correspond to the feature fields user, item, daytime, weekday, isweekend, homework, cost, weather, country, city.
Figure :

STEC: See-Through Transformer-based Encoder for CTR Prediction

TL;DR

Abstract

STEC: See-Through Transformer-based Encoder for CTR Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (4)