Table of Contents
Fetching ...

Mutual Learning for Finetuning Click-Through Rate Prediction Models

Ibrahim Can Yilmaz, Said Aldemir

TL;DR

This work investigates mutual learning as a finetuning strategy for CTR prediction, addressing the practicality gap between complex and simple models. It formalizes a mutual-learning objective where $N$ networks minimize a combination of supervised BCE loss and a mutual $L_{MSE}$ loss across peers, i.e., $L_{\theta_n} = L_{BCE_n} + \frac{1}{N - 1} \sum_{i \neq n} L_{MSE}(p_i, p_n)$ with $L_{MSE}(p_1, p_2) = (p_1 - p_2)^2$. Experiments on Criteo and Avazu with four baseline CTR models (DeepFM, DCN, PNN, FiBiNET) show that mutual learning from scratch does not help, but finetuning pretrained models with mutual learning yields consistent improvements up to $0.66\%$ relative on Avazu and $0.37\%$ on Criteo; gains are larger when accompanying models have higher pretrained scores. The results suggest that mutual learning benefits arise from aggregating diverse, non-model-specific knowledge, with an optimal co-training size around four, and demonstrate robust improvements across datasets and architectures.

Abstract

Click-Through Rate (CTR) prediction has become an essential task in digital industries, such as digital advertising or online shopping. Many deep learning-based methods have been implemented and have become state-of-the-art models in the domain. To further improve the performance of CTR models, Knowledge Distillation based approaches have been widely used. However, most of the current CTR prediction models do not have much complex architectures, so it's hard to call one of them 'cumbersome' and the other one 'tiny'. On the other hand, the performance gap is also not very large between complex and simple models. So, distilling knowledge from one model to the other could not be worth the effort. Under these considerations, Mutual Learning could be a better approach, since all the models could be improved mutually. In this paper, we showed how useful the mutual learning algorithm could be when it is between equals. In our experiments on the Criteo and Avazu datasets, the mutual learning algorithm improved the performance of the model by up to 0.66% relative improvement.

Mutual Learning for Finetuning Click-Through Rate Prediction Models

TL;DR

This work investigates mutual learning as a finetuning strategy for CTR prediction, addressing the practicality gap between complex and simple models. It formalizes a mutual-learning objective where networks minimize a combination of supervised BCE loss and a mutual loss across peers, i.e., with . Experiments on Criteo and Avazu with four baseline CTR models (DeepFM, DCN, PNN, FiBiNET) show that mutual learning from scratch does not help, but finetuning pretrained models with mutual learning yields consistent improvements up to relative on Avazu and on Criteo; gains are larger when accompanying models have higher pretrained scores. The results suggest that mutual learning benefits arise from aggregating diverse, non-model-specific knowledge, with an optimal co-training size around four, and demonstrate robust improvements across datasets and architectures.

Abstract

Click-Through Rate (CTR) prediction has become an essential task in digital industries, such as digital advertising or online shopping. Many deep learning-based methods have been implemented and have become state-of-the-art models in the domain. To further improve the performance of CTR models, Knowledge Distillation based approaches have been widely used. However, most of the current CTR prediction models do not have much complex architectures, so it's hard to call one of them 'cumbersome' and the other one 'tiny'. On the other hand, the performance gap is also not very large between complex and simple models. So, distilling knowledge from one model to the other could not be worth the effort. Under these considerations, Mutual Learning could be a better approach, since all the models could be improved mutually. In this paper, we showed how useful the mutual learning algorithm could be when it is between equals. In our experiments on the Criteo and Avazu datasets, the mutual learning algorithm improved the performance of the model by up to 0.66% relative improvement.
Paper Structure (17 sections, 5 equations, 2 figures, 4 tables)

This paper contains 17 sections, 5 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Architecture of Knowledge Distillation algorithm.
  • Figure 2: Architecture of proposed algorithm.