Table of Contents
Fetching ...

Feature Interaction Fusion Self-Distillation Network For CTR Prediction

Lei Sang, Qiuze Ru, Honghao Li, Yiwen Zhang, Qian Cao, Xindong Wu

TL;DR

FSDNet is proposed, a CTR prediction framework incorporating a plug-and-play fusion self-distillation module that forms connections between explicit and implicit feature interactions at each layer, enhancing the sharing of information between different features.

Abstract

Click-Through Rate (CTR) prediction plays a vital role in recommender systems, online advertising, and search engines. Most of the current approaches model feature interactions through stacked or parallel structures, with some employing knowledge distillation for model compression. However, we observe some limitations with these approaches: (1) In parallel structure models, the explicit and implicit components are executed independently and simultaneously, which leads to insufficient information sharing within the feature set. (2) The introduction of knowledge distillation technology brings about the problems of complex teacher-student framework design and low knowledge transfer efficiency. (3) The dataset and the process of constructing high-order feature interactions contain significant noise, which limits the model's effectiveness. To address these limitations, we propose FSDNet, a CTR prediction framework incorporating a plug-and-play fusion self-distillation module. Specifically, FSDNet forms connections between explicit and implicit feature interactions at each layer, enhancing the sharing of information between different features. The deepest fusion layer is then used as the teacher model, utilizing self-distillation to guide the training of shallow layers. Empirical evaluation across four benchmark datasets validates the framework's efficacy and generalization capabilities. The code is available on https://anonymous.4open.science/r/FSDNet.

Feature Interaction Fusion Self-Distillation Network For CTR Prediction

TL;DR

FSDNet is proposed, a CTR prediction framework incorporating a plug-and-play fusion self-distillation module that forms connections between explicit and implicit feature interactions at each layer, enhancing the sharing of information between different features.

Abstract

Click-Through Rate (CTR) prediction plays a vital role in recommender systems, online advertising, and search engines. Most of the current approaches model feature interactions through stacked or parallel structures, with some employing knowledge distillation for model compression. However, we observe some limitations with these approaches: (1) In parallel structure models, the explicit and implicit components are executed independently and simultaneously, which leads to insufficient information sharing within the feature set. (2) The introduction of knowledge distillation technology brings about the problems of complex teacher-student framework design and low knowledge transfer efficiency. (3) The dataset and the process of constructing high-order feature interactions contain significant noise, which limits the model's effectiveness. To address these limitations, we propose FSDNet, a CTR prediction framework incorporating a plug-and-play fusion self-distillation module. Specifically, FSDNet forms connections between explicit and implicit feature interactions at each layer, enhancing the sharing of information between different features. The deepest fusion layer is then used as the teacher model, utilizing self-distillation to guide the training of shallow layers. Empirical evaluation across four benchmark datasets validates the framework's efficacy and generalization capabilities. The code is available on https://anonymous.4open.science/r/FSDNet.

Paper Structure

This paper contains 32 sections, 16 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: The architecture comparison among parallel and stacked structures.
  • Figure 2: The structural comparison of knowledge distillation and self-distillation.
  • Figure 3: The overall framework of FSDNet. The embedding layer maps sparse vectors to low-dimensional dense embeddings, while the cross network captures explicit feature interactions and the deep network models implicit feature relationships. The fusion self-distillation module enhances the framework's performance and generalization ability by connecting the outputs from both networks and using the predictions from the final layer to guide the learning of earlier layers. The linear-activation layer represents the simultaneous use of a linear layer and an activation function to generate prediction values, and $\otimes$ represents the cross operation in Eq. \ref{['crossoperation']}.
  • Figure 4: Hyperparameter study of Loss Balance $\mu$.
  • Figure 5: Hyperparameter study of Temperature $\tau$.
  • ...and 4 more figures