Towards An Efficient LLM Training Paradigm for CTR Prediction
Allen Lin, Renqin Cai, Yun He, Hanchao Yu, Jing Qian, Rui Li, Qifan Wang, James Caverlee
TL;DR
This work tackles the high computational cost of training LLMs for CTR prediction with the sliding-window paradigm by introducing Dynamic Target Isolation (DTI), which uses streaming prompts to train multiple targets in parallel and windowed casual attention to keep complexity manageable. It demonstrates that training time can be reduced by about 92% on three public CTR datasets with minimal or no loss in predictive performance, by addressing two bottlenecks: hidden-state leakage and positional bias overfitting. The paper provides formal FLOPs reductions, experimental validation, and ablations, and proposes practical remedies (distance-based forgetting and ALiBi-based attention) to enable scaling to larger k. Overall, DTI offers a practical, efficient pathway for deploying LLM-based CTR models in large-scale, real-world settings, with clear guidance on prompt design and attention mechanisms.
Abstract
Large Language Models (LLMs) have demonstrated tremendous potential as the next-generation ranking-based recommendation system. Many recent works have shown that LLMs can significantly outperform conventional click-through-rate (CTR) prediction approaches. Despite such promising results, the computational inefficiency inherent in the current training paradigm makes it particularly challenging to train LLMs for ranking-based recommendation tasks on large datasets. To train LLMs for CTR prediction, most existing studies adopt the prevalent ''sliding-window'' paradigm. Given a sequence of $m$ user interactions, a unique training prompt is constructed for each interaction by designating it as the prediction target along with its preceding $n$ interactions serving as context. In turn, the sliding-window paradigm results in an overall complexity of $O(mn^2)$ that scales linearly with the length of user interactions. Consequently, a direct adoption to train LLMs with such strategy can result in prohibitively high training costs as the length of interactions grows. To alleviate the computational inefficiency, we propose a novel training paradigm, namely Dynamic Target Isolation (DTI), that structurally parallelizes the training of $k$ (where $k >> 1$) target interactions. Furthermore, we identify two major bottlenecks - hidden-state leakage and positional bias overfitting - that limit DTI to only scale up to a small value of $k$ (e.g., 5) then propose a computationally light solution to effectively tackle each. Through extensive experiments on three widely adopted public CTR datasets, we empirically show that DTI reduces training time by an average of $\textbf{92%}$ (e.g., from $70.5$ hrs to $5.31$ hrs), without compromising CTR prediction performance.
