AlignRec: Aligning and Training in Multimodal Recommendations

Yifan Liu; Kangning Zhang; Xiangyuan Ren; Yanhua Huang; Jiarui Jin; Yingjie Qin; Ruilong Su; Ruiwen Xu; Yong Yu; Weinan Zhang

AlignRec: Aligning and Training in Multimodal Recommendations

Yifan Liu, Kangning Zhang, Xiangyuan Ren, Yanhua Huang, Jiarui Jin, Yingjie Qin, Ruilong Su, Ruiwen Xu, Yong Yu, Weinan Zhang

TL;DR

AlignRec tackles the misalignment between multimodal content and ID-based features in recommender systems by decomposing the objective into three alignments: Inter-Content Alignment (ICA), Content-Category Alignment (CCA), and User-Item Alignment (UIA). It introduces a three-module architecture (MMEnc, Aggregator, Fuser) and a two-stage training protocol (ICA pre-training followed by joint optimization of CCA/UIA with BPR), complemented by three intermediate evaluation schemes to assess multimodal feature quality. Empirical results on three real-world Amazon domains show state-of-the-art performance and demonstrate that the learned multimodal features are smaller and more effective than prior pre-extracted features. The work also provides extensive ablations, hyper-parameter analyses, and long-tail item findings, highlighting AlignRec’s practical impact for robust, scalable multimodal recommendations and its potential for open-source adoption.

Abstract

With the development of multimedia systems, multimodal recommendations are playing an essential role, as they can leverage rich contexts beyond interactions. Existing methods mainly regard multimodal information as an auxiliary, using them to help learn ID features; However, there exist semantic gaps among multimodal content features and ID-based features, for which directly using multimodal information as an auxiliary would lead to misalignment in representations of users and items. In this paper, we first systematically investigate the misalignment issue in multimodal recommendations, and propose a solution named AlignRec. In AlignRec, the recommendation objective is decomposed into three alignments, namely alignment within contents, alignment between content and categorical ID, and alignment between users and items. Each alignment is characterized by a specific objective function and is integrated into our multimodal recommendation framework. To effectively train AlignRec, we propose starting from pre-training the first alignment to obtain unified multimodal features and subsequently training the following two alignments together with these features as input. As it is essential to analyze whether each multimodal feature helps in training and accelerate the iteration cycle of recommendation models, we design three new classes of metrics to evaluate intermediate performance. Our extensive experiments on three real-world datasets consistently verify the superiority of AlignRec compared to nine baselines. We also find that the multimodal features generated by AlignRec are better than currently used ones, which are to be open-sourced in our repository https://github.com/sjtulyf123/AlignRec_CIKM24.

AlignRec: Aligning and Training in Multimodal Recommendations

TL;DR

Abstract

Paper Structure (35 sections, 15 equations, 6 figures, 10 tables)

This paper contains 35 sections, 15 equations, 6 figures, 10 tables.

Introduction
Related Work
Problem Formulation
AlignRec
Framework Overview
Architecture Design
Multimodal Encoder Module
Aggregation Module
Fusion Module
Three Alignment Objectives
Inter-Content Alignment
Content-Category Alignment
User-Item Alignment
Training and Evaluating AlignRec
Training Strategies
...and 20 more sections

Figures (6)

Figure 1: Comparison among VBPR he2016vbpr, FREEDOM zhou2023tale, BM3 zhou2023bootstrap and AlignRec. $\mathcal{L}_{ICA}, \mathcal{L}_{CCA}, \mathcal{L}_{UIA}$ are losses for inter-content alignment, content-category alignment and user-item alignment. Dashed lines are scopes of alignment losses.
Figure 2: An overview of AlignRec, where user ID, item ID, text and image are input, and user and item representations are output. A and B show the differences between AlignRec and current methods when training, where an intermediate evaluation module and a two-stage training strategy are proposed in AlignRec. C shows the overall architecture of AlignRec.
Figure 3: Hyper-parameter study on two alignment weights.
Figure 4: The t-SNE results of content and ID modality feature pairs with and without alignment.
Figure 5: Results of long-tail items recommendation.
...and 1 more figures

AlignRec: Aligning and Training in Multimodal Recommendations

TL;DR

Abstract

AlignRec: Aligning and Training in Multimodal Recommendations

Authors

TL;DR

Abstract

Table of Contents

Figures (6)