Table of Contents
Fetching ...

Decoupled Multimodal Fusion for User Interest Modeling in Click-Through Rate Prediction

Alin Fan, Hanqing Li, Sihan Lu, Jingsong Yuan, Jiandong Zhang

TL;DR

This work tackles the mismatch between sparse ID-based signals and rich multimodal content in CTR prediction. It introduces Decoupled Multimodal Fusion (DMF), which combines a modality-enriched pathway (via Decoupled Target Attention) with a modality-centric pathway (via similarity histograms) through Complementary Modality Modeling, yielding strong offline gains and real-world business impact. The approach achieves efficient inference by decoupling target-aware computations from reusable target-agnostic components and demonstrates robust improvements on public and Lazada industrial data, including substantial online metrics. Overall, DMF provides a practical, scalable solution for integrating multimodal signals into industrial recommender systems with improved user-interest modeling and click-through performance.

Abstract

Modern industrial recommendation systems improve recommendation performance by integrating multimodal representations from pre-trained models into ID-based Click-Through Rate (CTR) prediction frameworks. However, existing approaches typically adopt modality-centric modeling strategies that process ID-based and multimodal embeddings independently, failing to capture fine-grained interactions between content semantics and behavioral signals. In this paper, we propose Decoupled Multimodal Fusion (DMF), which introduces a modality-enriched modeling strategy to enable fine-grained interactions between ID-based collaborative representations and multimodal representations for user interest modeling. Specifically, we construct target-aware features to bridge the semantic gap across different embedding spaces and leverage them as side information to enhance the effectiveness of user interest modeling. Furthermore, we design an inference-optimized attention mechanism that decouples the computation of target-aware features and ID-based embeddings before the attention layer, thereby alleviating the computational bottleneck introduced by incorporating target-aware features. To achieve comprehensive multimodal integration, DMF combines user interest representations learned under the modality-centric and modality-enriched modeling strategies. Offline experiments on public and industrial datasets demonstrate the effectiveness of DMF. Moreover, DMF has been deployed on the product recommendation system of the international e-commerce platform Lazada, achieving relative improvements of 5.30% in CTCVR and 7.43% in GMV with negligible computational overhead.

Decoupled Multimodal Fusion for User Interest Modeling in Click-Through Rate Prediction

TL;DR

This work tackles the mismatch between sparse ID-based signals and rich multimodal content in CTR prediction. It introduces Decoupled Multimodal Fusion (DMF), which combines a modality-enriched pathway (via Decoupled Target Attention) with a modality-centric pathway (via similarity histograms) through Complementary Modality Modeling, yielding strong offline gains and real-world business impact. The approach achieves efficient inference by decoupling target-aware computations from reusable target-agnostic components and demonstrates robust improvements on public and Lazada industrial data, including substantial online metrics. Overall, DMF provides a practical, scalable solution for integrating multimodal signals into industrial recommender systems with improved user-interest modeling and click-through performance.

Abstract

Modern industrial recommendation systems improve recommendation performance by integrating multimodal representations from pre-trained models into ID-based Click-Through Rate (CTR) prediction frameworks. However, existing approaches typically adopt modality-centric modeling strategies that process ID-based and multimodal embeddings independently, failing to capture fine-grained interactions between content semantics and behavioral signals. In this paper, we propose Decoupled Multimodal Fusion (DMF), which introduces a modality-enriched modeling strategy to enable fine-grained interactions between ID-based collaborative representations and multimodal representations for user interest modeling. Specifically, we construct target-aware features to bridge the semantic gap across different embedding spaces and leverage them as side information to enhance the effectiveness of user interest modeling. Furthermore, we design an inference-optimized attention mechanism that decouples the computation of target-aware features and ID-based embeddings before the attention layer, thereby alleviating the computational bottleneck introduced by incorporating target-aware features. To achieve comprehensive multimodal integration, DMF combines user interest representations learned under the modality-centric and modality-enriched modeling strategies. Offline experiments on public and industrial datasets demonstrate the effectiveness of DMF. Moreover, DMF has been deployed on the product recommendation system of the international e-commerce platform Lazada, achieving relative improvements of 5.30% in CTCVR and 7.43% in GMV with negligible computational overhead.

Paper Structure

This paper contains 27 sections, 19 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: (a) Modality-centric Modeling: ID-based embeddings and multimodal representations are encoded independently, without fine-grained interaction during user interest modeling. (b) Modality-enriched Modeling: Multimodal representations are integrated as side information into the interest encoder, enabling fine-grained interaction between content semantics and behavioral signals during user interest modeling. (c) Complementary Modality Modeling (CMM): A structured fusion strategy that combines modality-centric and modality-enriched pathways. This hybrid approach produces a comprehensive user interest representation balancing semantic generalization and fine-grained behavioral personalization.
  • Figure 2: The framework of DMF. Multimodal representations are used to compute similarity scores between the target item and each historically interacted item. These scores are discretized and mapped into learnable embeddings as input to the DTA module, while SH (Similarity Histogram) operates directly on the raw similarity scores. Mod-Centric and Mod-Enriched denote the Modality-Centric and Modality-Enriched modeling strategies, respectively. The resulting user interest representations are fused by the CMM module and combined with the target item for CTR prediction.
  • Figure 3: Illustration of target-agnostic and target-aware node computation in a CTR model. Left panel shows target-agnostic features derived from user interaction sequence, which are computed once and reused across all target items. Right panel shows target-aware features, where each target item triggers a distinct computation path for its associated historical interactions. The output of target-agnostic nodes is reusable, while target-aware nodes require recomputation per target item
  • Figure 4: Comparison of side information fusion methods based on target-aware attention: (a) early fusion, (b) late fusion, and (c) decoupled fusion, which balances efficiency and effectiveness.
  • Figure 5: Performance with varying representation aggregating hyperparameter $\alpha$. When $\alpha=0$, only the modality-centric modeling strategy is employed, and when $\alpha=1$, only the modality-enriched modeling strategy is utilized.
  • ...and 2 more figures