Table of Contents
Fetching ...

HoMer: Addressing Heterogeneities by Modeling Sequential and Set-wise Contexts for CTR Prediction

Shuwei Chen, Jiajun Cui, Zhengqi Xu, Fan Zhang, Jiangke Fan, Teng Zhang, Xingxing Wang

TL;DR

HoMer addresses three forms of heterogeneity in CTR prediction by unifying panoramic sequence modeling with set-wise cross-item interactions in a homogeneous Transformer. The panoramic sequence aligns rich non-sequential features with history to produce fine-grained user interest, while the set-wise decoder captures cross-item and user-item interactions across the entire exposed item set in a single model invocation. Empirical results on Meituan’s search ads show an AUC improvement of about 0.0099 over industrial baselines, plus online CTR and RPM boosts of ~1.99% and ~2.46%, respectively, and a 27% reduction in GPU usage due to kernel fusion and shared computations. The work demonstrates strong offline and online performance, practical deployment efficiency, and scalability for large-scale industrial recommender systems.

Abstract

Click-through rate (CTR) prediction, which models behavior sequence and non-sequential features (e.g., user/item profiles or cross features) to infer user interest, underpins industrial recommender systems. However, most methods face three forms of heterogeneity that degrade predictive performance: (i) Feature Heterogeneity persists when limited sequence side features provide less granular interest representation compared to extensive non-sequential features, thereby impairing sequence modeling performance; (ii) Context Heterogeneity arises because a user's interest in an item will be influenced by other items, yet point-wise prediction neglects cross-item interaction context from the entire item set; (iii) Architecture Heterogeneity stems from the fragmented integration of specialized network modules, which compounds the model's effectiveness, efficiency and scalability in industrial deployments. To tackle the above limitations, we propose HoMer, a Homogeneous-Oriented TransforMer for modeling sequential and set-wise contexts. First, we align sequence side features with non-sequential features for accurate sequence modeling and fine-grained interest representation. Second, we shift the prediction paradigm from point-wise to set-wise, facilitating cross-item interaction in a highly parallel manner. Third, HoMer's unified encoder-decoder architecture achieves dual optimization through structural simplification and shared computation, ensuring computational efficiency while maintaining scalability with model size. Without arduous modification to the prediction pipeline, HoMer successfully scales up and outperforms our industrial baseline by 0.0099 in the AUC metric, and enhances online business metrics like CTR/RPM by 1.99%/2.46%. Additionally, HoMer saves 27% of GPU resources via preliminary engineering optimization, further validating its superiority and practicality.

HoMer: Addressing Heterogeneities by Modeling Sequential and Set-wise Contexts for CTR Prediction

TL;DR

HoMer addresses three forms of heterogeneity in CTR prediction by unifying panoramic sequence modeling with set-wise cross-item interactions in a homogeneous Transformer. The panoramic sequence aligns rich non-sequential features with history to produce fine-grained user interest, while the set-wise decoder captures cross-item and user-item interactions across the entire exposed item set in a single model invocation. Empirical results on Meituan’s search ads show an AUC improvement of about 0.0099 over industrial baselines, plus online CTR and RPM boosts of ~1.99% and ~2.46%, respectively, and a 27% reduction in GPU usage due to kernel fusion and shared computations. The work demonstrates strong offline and online performance, practical deployment efficiency, and scalability for large-scale industrial recommender systems.

Abstract

Click-through rate (CTR) prediction, which models behavior sequence and non-sequential features (e.g., user/item profiles or cross features) to infer user interest, underpins industrial recommender systems. However, most methods face three forms of heterogeneity that degrade predictive performance: (i) Feature Heterogeneity persists when limited sequence side features provide less granular interest representation compared to extensive non-sequential features, thereby impairing sequence modeling performance; (ii) Context Heterogeneity arises because a user's interest in an item will be influenced by other items, yet point-wise prediction neglects cross-item interaction context from the entire item set; (iii) Architecture Heterogeneity stems from the fragmented integration of specialized network modules, which compounds the model's effectiveness, efficiency and scalability in industrial deployments. To tackle the above limitations, we propose HoMer, a Homogeneous-Oriented TransforMer for modeling sequential and set-wise contexts. First, we align sequence side features with non-sequential features for accurate sequence modeling and fine-grained interest representation. Second, we shift the prediction paradigm from point-wise to set-wise, facilitating cross-item interaction in a highly parallel manner. Third, HoMer's unified encoder-decoder architecture achieves dual optimization through structural simplification and shared computation, ensuring computational efficiency while maintaining scalability with model size. Without arduous modification to the prediction pipeline, HoMer successfully scales up and outperforms our industrial baseline by 0.0099 in the AUC metric, and enhances online business metrics like CTR/RPM by 1.99%/2.46%. Additionally, HoMer saves 27% of GPU resources via preliminary engineering optimization, further validating its superiority and practicality.

Paper Structure

This paper contains 28 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Illustrations of heterogeneities in traditional CTR prediction paradigm. (a) Feature Heterogeneity: The misalignment between sequence side features and non-sequential features produces coarse-grained user interest representation. (b) Context Heterogeneity: In point-wise prediction paradigm, the neglect of cross-item interaction context from the entire item set limits the model's capability to capture authentic user behavior patterns. (c) Architecture Heterogeneity: The fragmented integration of specialized network modules constraints the model's effectiveness, efficiency and scalability.
  • Figure 2: Comparison of feature schemas between point-wise prediction and HoMer's set-wise prediction.
  • Figure 3: The overall architecture of HoMer. The sequential encoder is responsible for modeling fine-grained user interest representation from panoramic sequence, and the set-wise decoder is tasked with capturing cross-item interaction context from the features of the entire item set.
  • Figure 4: Scalability of models with respect to FLOPs.
  • Figure 5: Ablation of cross-item interaction modeling.
  • ...and 1 more figures