Multimodality Invariant Learning for Multimedia-Based New Item Recommendation

Haoyue Bai; Le Wu; Min Hou; Miaomiao Cai; Zhuangzhuang He; Yuyang Zhou; Richang Hong; Meng Wang

Multimodality Invariant Learning for Multimedia-Based New Item Recommendation

Haoyue Bai, Le Wu, Min Hou, Miaomiao Cai, Zhuangzhuang He, Yuyang Zhou, Richang Hong, Meng Wang

TL;DR

This paper tackles the challenge of recommending newly added items in multimedia contexts when modalities may be incomplete. It introduces MILK, a two-module framework combining Cross-Modality Alignment with Cross-Environment Invariant learning, and uses cyclic mixup to create diverse, heterogeneous environments sampled via a Dirichlet distribution to simulate arbitrary modality missingness. By enforcing invariant user-content preferences across these environments, MILK achieves robust performance gains over state-of-the-art baselines on three real-world datasets, especially under missing modalities. The approach offers practical impact for fast-adapting recommender systems operating in real-world, multimodal ecosystems where modality completeness cannot be guaranteed.

Abstract

Multimedia-based recommendation provides personalized item suggestions by learning the content preferences of users. With the proliferation of digital devices and APPs, a huge number of new items are created rapidly over time. How to quickly provide recommendations for new items at the inference time is challenging. What's worse, real-world items exhibit varying degrees of modality missing(e.g., many short videos are uploaded without text descriptions). Though many efforts have been devoted to multimedia-based recommendations, they either could not deal with new multimedia items or assumed the modality completeness in the modeling process. In this paper, we highlight the necessity of tackling the modality missing issue for new item recommendation. We argue that users' inherent content preference is stable and better kept invariant to arbitrary modality missing environments. Therefore, we approach this problem from a novel perspective of invariant learning. However, how to construct environments from finite user behavior training data to generalize any modality missing is challenging. To tackle this issue, we propose a novel Multimodality Invariant Learning reCommendation(a.k.a. MILK) framework. Specifically, MILK first designs a cross-modality alignment module to keep semantic consistency from pretrained multimedia item features. After that, MILK designs multi-modal heterogeneous environments with cyclic mixup to augment training data, in order to mimic any modality missing for invariant user preference learning. Extensive experiments on three real datasets verify the superiority of our proposed framework. The code is available at https://github.com/HaoyueBai98/MILK.

Multimodality Invariant Learning for Multimedia-Based New Item Recommendation

TL;DR

Abstract

Paper Structure (27 sections, 18 equations, 6 figures, 4 tables)

This paper contains 27 sections, 18 equations, 6 figures, 4 tables.

Introduction
Related Work
Multimedia-Based Recommendation
Invariant Learning for Recommendation
Problem Formulation
New Item Recommendation
Modality Missing Issue
the proposed MILK Framework
Overview of MILK
Cross-Modality Alignment Module
Cross-Environment Invariant Module
Model Optimization and Inference
Experiments
Experimental Settings
Datasets
...and 12 more sections

Figures (6)

Figure 1: New Item Recommendation with Missing Modalities
Figure 2: Model overview. MILK is consisted of Cross-Modality Alignment Module (CMAM) and Cross-Environment Invariant Module (CEIM). CMAM obtains the modality representations $\mathbf{c}^m$ through independent feature extractors $\mathcal{G}^m$ and then imposes alignment between any two modalities. CEIM converts user ID into user representation by embedding function $\mathcal{P}$ and generates item representations through fusion functions $\mathcal{Q}$. CEIM generates multiple sets of weights as heterogeneous environments through cyclic mixup and aggregates multi-modal representations into item representations $\mathbf{z}^{e}_{j}$ in each environment $e$. Finally, CEIM optimizes the model under the invariant learning paradigm.
Figure 3: Performance on different missing scenarios.
Figure 4: Ablation experiments on Baby and Clothing datasets.
Figure 5: Effect of different modules on the robustness of the model.
...and 1 more figures

Multimodality Invariant Learning for Multimedia-Based New Item Recommendation

TL;DR

Abstract

Multimodality Invariant Learning for Multimedia-Based New Item Recommendation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)