Table of Contents
Fetching ...

Modality-Balanced Learning for Multimedia Recommendation

Jinghao Zhang, Guofan Liu, Qiang Liu, Shu Wu, Liang Wang

TL;DR

This work tackles the modal imbalance problem in multimodal multimedia recommendations by introducing Counterfactual Knowledge Distillation (CKD), a plug-and-play framework that guides a multimodal student with modality-specific uni-modal teachers and two distillation losses (specific and generic). It further adapts training focus using counterfactual learning-speed estimation to reweight each modality’s contribution, mitigating under-optimization of weaker modalities. Across four Amazon datasets and six backbones, CKD yields substantial improvements in Recall@20 and other metrics, and ablation studies confirm the necessity of each component (uni-modal teachers, hinge-based specific distillation, generic distillation, and reweighting). The approach is model-agnostic and can augment both late-fusion and early-fusion architectures, offering practical benefits for robust, balanced multimodal recommendations.

Abstract

Many recommender models have been proposed to investigate how to incorporate multimodal content information into traditional collaborative filtering framework effectively. The use of multimodal information is expected to provide more comprehensive information and lead to superior performance. However, the integration of multiple modalities often encounters the modal imbalance problem: since the information in different modalities is unbalanced, optimizing the same objective across all modalities leads to the under-optimization problem of the weak modalities with a slower convergence rate or lower performance. Even worse, we find that in multimodal recommendation models, all modalities suffer from the problem of insufficient optimization. To address these issues, we propose a Counterfactual Knowledge Distillation method that could solve the imbalance problem and make the best use of all modalities. Through modality-specific knowledge distillation, it could guide the multimodal model to learn modality-specific knowledge from uni-modal teachers. We also design a novel generic-and-specific distillation loss to guide the multimodal student to learn wider-and-deeper knowledge from teachers. Additionally, to adaptively recalibrate the focus of the multimodal model towards weaker modalities during training, we estimate the causal effect of each modality on the training objective using counterfactual inference techniques, through which we could determine the weak modalities, quantify the imbalance degree and re-weight the distillation loss accordingly. Our method could serve as a plug-and-play module for both late-fusion and early-fusion backbones. Extensive experiments on six backbones show that our proposed method can improve the performance by a large margin. The source code will be released at \url{https://github.com/CRIPAC-DIG/Balanced-Multimodal-Rec}

Modality-Balanced Learning for Multimedia Recommendation

TL;DR

This work tackles the modal imbalance problem in multimodal multimedia recommendations by introducing Counterfactual Knowledge Distillation (CKD), a plug-and-play framework that guides a multimodal student with modality-specific uni-modal teachers and two distillation losses (specific and generic). It further adapts training focus using counterfactual learning-speed estimation to reweight each modality’s contribution, mitigating under-optimization of weaker modalities. Across four Amazon datasets and six backbones, CKD yields substantial improvements in Recall@20 and other metrics, and ablation studies confirm the necessity of each component (uni-modal teachers, hinge-based specific distillation, generic distillation, and reweighting). The approach is model-agnostic and can augment both late-fusion and early-fusion architectures, offering practical benefits for robust, balanced multimodal recommendations.

Abstract

Many recommender models have been proposed to investigate how to incorporate multimodal content information into traditional collaborative filtering framework effectively. The use of multimodal information is expected to provide more comprehensive information and lead to superior performance. However, the integration of multiple modalities often encounters the modal imbalance problem: since the information in different modalities is unbalanced, optimizing the same objective across all modalities leads to the under-optimization problem of the weak modalities with a slower convergence rate or lower performance. Even worse, we find that in multimodal recommendation models, all modalities suffer from the problem of insufficient optimization. To address these issues, we propose a Counterfactual Knowledge Distillation method that could solve the imbalance problem and make the best use of all modalities. Through modality-specific knowledge distillation, it could guide the multimodal model to learn modality-specific knowledge from uni-modal teachers. We also design a novel generic-and-specific distillation loss to guide the multimodal student to learn wider-and-deeper knowledge from teachers. Additionally, to adaptively recalibrate the focus of the multimodal model towards weaker modalities during training, we estimate the causal effect of each modality on the training objective using counterfactual inference techniques, through which we could determine the weak modalities, quantify the imbalance degree and re-weight the distillation loss accordingly. Our method could serve as a plug-and-play module for both late-fusion and early-fusion backbones. Extensive experiments on six backbones show that our proposed method can improve the performance by a large margin. The source code will be released at \url{https://github.com/CRIPAC-DIG/Balanced-Multimodal-Rec}
Paper Structure (27 sections, 19 equations, 2 figures, 3 tables)

This paper contains 27 sections, 19 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: A pilot study of different model variants on Amazon-Clothing. The shaded area indicates the degree of under-optimization of each modality (best viewed in color). With the use of early stopping, the training terminates at different steps, which results in the different lengths of curves.
  • Figure 2: An illustration of CKD model architecture.