Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation

Han Liu; Yinwei Wei; Fan Liu; Wenjie Wang; Liqiang Nie; Tat-Seng Chua

Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation

Han Liu, Yinwei Wei, Fan Liu, Wenjie Wang, Liqiang Nie, Tat-Seng Chua

TL;DR

This work tackles the inefficiency of static multimodal fusion in micro-video recommendation by introducing MetaMMF, a meta-learning framework that generates item-specific fusion parameters for each micro-video. It comprises a meta information extractor and a meta fusion learner to produce adaptive fusion weights, with optional CP decomposition to reduce parameter count and improve efficiency. MetaMMF can be integrated with MF or GCN (MetaMMF_MF, MetaMMF_GCN), achieving state-of-the-art results on MovieLens, TikTok, and Kwai, and demonstrating benefits in representation learning and convergence. The approach enables dynamic, per-item fusion of visual, acoustic, and textual modalities, offering practical improvements for real-world micro-video recommendation systems and potential applicability to other multimodal tasks.

Abstract

Multimodal information (e.g., visual, acoustic, and textual) has been widely used to enhance representation learning for micro-video recommendation. For integrating multimodal information into a joint representation of micro-video, multimodal fusion plays a vital role in the existing micro-video recommendation approaches. However, the static multimodal fusion used in previous studies is insufficient to model the various relationships among multimodal information of different micro-videos. In this paper, we develop a novel meta-learning-based multimodal fusion framework called Meta Multimodal Fusion (MetaMMF), which dynamically assigns parameters to the multimodal fusion function for each micro-video during its representation learning. Specifically, MetaMMF regards the multimodal fusion of each micro-video as an independent task. Based on the meta information extracted from the multimodal features of the input task, MetaMMF parameterizes a neural network as the item-specific fusion function via a meta learner. We perform extensive experiments on three benchmark datasets, demonstrating the significant improvements over several state-of-the-art multimodal recommendation models, like MMGCN, LATTICE, and InvRL. Furthermore, we lighten our model by adopting canonical polyadic decomposition to improve the training efficiency, and validate its effectiveness through experimental results. Codes are available at https://github.com/hanliu95/MetaMMF.

Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation

TL;DR

Abstract

Paper Structure (32 sections, 19 equations, 6 figures, 5 tables)

This paper contains 32 sections, 19 equations, 6 figures, 5 tables.

Introduction
Related Work
Micro-Video Recommendation
Multimodal Fusion
Meta-Learning in Recommendation
Preliminaries
Micro-Video Recommendation
Multimodal Fusion for Multimodal Representation Learning
Problem Set-Up of Dynamic Multimodal Fusion
Method
General One-Layer Fusion Framework
Meta Information Extractor
Meta Fusion Learner
Deep Multi-Layer Fusion Framework
Model Simplification
...and 17 more sections

Figures (6)

Figure 1: Illustration of the static and the dynamic multimodal fusion. $\mathbf{x}_i^v$, $\mathbf{x}_i^a$, and $\mathbf{x}_i^t$ denote the features derived from visual, acoustic, and textual modalities of micro-video $i$, respectively. $\mathbf{e}_i^m$ denotes the multimodal representation of micro-video $i$ through the fusion function. $\Theta_i$ denotes the item-specific parameters dynamically generated for the fusion function by a meta-learning algorithm. Similar notations are used for denoting the attributes of micro-video $j$.
Figure 2: Schematic illustration of our proposed model.
Figure 3: CP decomposition of a 3-D tensor.
Figure 4: The visualization depicts the t-SNE transformed representations obtained from our methods and baselines. Each star corresponds to a user from the TikTok dataset, while points with the same color signify relevant items. A link between a star and a point represents their interaction. For optimal viewing, please refer to the colored version. The notation $S$ indicates the mean silhouette coefficient of the clustering result for the sample representations.
Figure 5: Effect of hyper-parameter $R$ in CP decomposition.
...and 1 more figures

Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation

TL;DR

Abstract

Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)