Table of Contents
Fetching ...

Quadratic Interest Network for Multimodal Click-Through Rate Prediction

Honghao Li, Hanwei Li, Jing Zhang, Yi Zhang, Ziniu Yu, Lei Sang, Yiwen Zhang

TL;DR

The paper tackles multimodal CTR prediction under strict latency requirements. It introduces Quadratic Interest Network (QIN), composed of Adaptive Sparse Target Attention (ASTA) for dynamic, sparse user-behavior extraction and a Quadratic Neural Network (QNN) for explicit high-order feature interactions. Empirical results show that QIN achieves strong performance (AUC up to around 0.97 on validation and 0.9798 on the leaderboard) and that both ASTA and QNN contribute significantly, as shown by comprehensive ablations. The work demonstrates practical viability for production-grade, multimodal recommender systems and provides open-source resources for replication.

Abstract

Multimodal click-through rate (CTR) prediction is a key technique in industrial recommender systems. It leverages heterogeneous modalities such as text, images, and behavioral logs to capture high-order feature interactions between users and items, thereby enhancing the system's understanding of user interests and its ability to predict click behavior. The primary challenge in this field lies in effectively utilizing the rich semantic information from multiple modalities while satisfying the low-latency requirements of online inference in real-world applications. To foster progress in this area, the Multimodal CTR Prediction Challenge Track of the WWW 2025 EReL@MIR Workshop formulates the problem into two tasks: (1) Task 1 of Multimodal Item Embedding: this task aims to explore multimodal information extraction and item representation learning methods that enhance recommendation tasks; and (2) Task 2 of Multimodal CTR Prediction: this task aims to explore what multimodal recommendation model can effectively leverage multimodal embedding features and achieve better performance. In this paper, we propose a novel model for Task 2, named Quadratic Interest Network (QIN) for Multimodal CTR Prediction. Specifically, QIN employs adaptive sparse target attention to extract multimodal user behavior features, and leverages Quadratic Neural Networks to capture high-order feature interactions. As a result, QIN achieved an AUC of 0.9798 on the leaderboard and ranked second in the competition. The model code, training logs, hyperparameter configurations, and checkpoints are available at https://github.com/salmon1802/QIN.

Quadratic Interest Network for Multimodal Click-Through Rate Prediction

TL;DR

The paper tackles multimodal CTR prediction under strict latency requirements. It introduces Quadratic Interest Network (QIN), composed of Adaptive Sparse Target Attention (ASTA) for dynamic, sparse user-behavior extraction and a Quadratic Neural Network (QNN) for explicit high-order feature interactions. Empirical results show that QIN achieves strong performance (AUC up to around 0.97 on validation and 0.9798 on the leaderboard) and that both ASTA and QNN contribute significantly, as shown by comprehensive ablations. The work demonstrates practical viability for production-grade, multimodal recommender systems and provides open-source resources for replication.

Abstract

Multimodal click-through rate (CTR) prediction is a key technique in industrial recommender systems. It leverages heterogeneous modalities such as text, images, and behavioral logs to capture high-order feature interactions between users and items, thereby enhancing the system's understanding of user interests and its ability to predict click behavior. The primary challenge in this field lies in effectively utilizing the rich semantic information from multiple modalities while satisfying the low-latency requirements of online inference in real-world applications. To foster progress in this area, the Multimodal CTR Prediction Challenge Track of the WWW 2025 EReL@MIR Workshop formulates the problem into two tasks: (1) Task 1 of Multimodal Item Embedding: this task aims to explore multimodal information extraction and item representation learning methods that enhance recommendation tasks; and (2) Task 2 of Multimodal CTR Prediction: this task aims to explore what multimodal recommendation model can effectively leverage multimodal embedding features and achieve better performance. In this paper, we propose a novel model for Task 2, named Quadratic Interest Network (QIN) for Multimodal CTR Prediction. Specifically, QIN employs adaptive sparse target attention to extract multimodal user behavior features, and leverages Quadratic Neural Networks to capture high-order feature interactions. As a result, QIN achieved an AUC of 0.9798 on the leaderboard and ranked second in the competition. The model code, training logs, hyperparameter configurations, and checkpoints are available at https://github.com/salmon1802/QIN.

Paper Structure

This paper contains 10 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The architecture of Quadratic Interest Network.
  • Figure 2: Comparison of MLP and QNN. The input to the MLP consists of raw features, while the QNN uses linearly independent quadratic polynomials as input.