Table of Contents
Fetching ...

DREAM: A Dual Representation Learning Model for Multimodal Recommendation

Kangning Zhang, Yingjie Qin, Jiarui Jin, Yifan Liu, Ruilong Su, Weinan Zhang, Yong Yu

TL;DR

DREAM tackles multimodal recommendation by learning dual representations for behavior and modality. It introduces a Modal-specific Encoder with filter gates and relation graphs, a Similarity Supervised Signal to mitigate Modal Information Forgetting, and a Behavior-Modal Alignment module that jointly optimizes intra- and inter-domain alignment, followed by simple fusion. Empirical results on three datasets show DREAM achieving state-of-the-art performance, with ablations confirming the importance of modal encoding, S3, and BMA. The approach enhances robust use of multimodal signals and demonstrates potential for transferring alignment techniques to other models.

Abstract

Multimodal recommendation focuses primarily on effectively exploiting both behavioral and multimodal information for the recommendation task. However, most existing models suffer from the following issues when fusing information from two different domains: (1) Previous works do not pay attention to the sufficient utilization of modal information by only using direct concatenation, addition, or simple linear layers for modal information extraction. (2) Previous works treat modal features as learnable embeddings, which causes the modal embeddings to gradually deviate from the original modal features during learning. We refer to this issue as Modal Information Forgetting. (3) Previous approaches fail to account for the significant differences in the distribution between behavior and modality, leading to the issue of representation misalignment. To address these challenges, this paper proposes a novel Dual REpresentAtion learning model for Multimodal Recommendation called DREAM. For sufficient information extraction, we introduce separate dual lines, including Behavior Line and Modal Line, in which the Modal-specific Encoder is applied to empower modal representations. To address the issue of Modal Information Forgetting, we introduce the Similarity Supervised Signal to constrain the modal representations. Additionally, we design a Behavior-Modal Alignment module to fuse the dual representations through Intra-Alignment and Inter-Alignment. Extensive experiments on three public datasets demonstrate that the proposed DREAM method achieves state-of-the-art (SOTA) results. The source code will be available upon acceptance.

DREAM: A Dual Representation Learning Model for Multimodal Recommendation

TL;DR

DREAM tackles multimodal recommendation by learning dual representations for behavior and modality. It introduces a Modal-specific Encoder with filter gates and relation graphs, a Similarity Supervised Signal to mitigate Modal Information Forgetting, and a Behavior-Modal Alignment module that jointly optimizes intra- and inter-domain alignment, followed by simple fusion. Empirical results on three datasets show DREAM achieving state-of-the-art performance, with ablations confirming the importance of modal encoding, S3, and BMA. The approach enhances robust use of multimodal signals and demonstrates potential for transferring alignment techniques to other models.

Abstract

Multimodal recommendation focuses primarily on effectively exploiting both behavioral and multimodal information for the recommendation task. However, most existing models suffer from the following issues when fusing information from two different domains: (1) Previous works do not pay attention to the sufficient utilization of modal information by only using direct concatenation, addition, or simple linear layers for modal information extraction. (2) Previous works treat modal features as learnable embeddings, which causes the modal embeddings to gradually deviate from the original modal features during learning. We refer to this issue as Modal Information Forgetting. (3) Previous approaches fail to account for the significant differences in the distribution between behavior and modality, leading to the issue of representation misalignment. To address these challenges, this paper proposes a novel Dual REpresentAtion learning model for Multimodal Recommendation called DREAM. For sufficient information extraction, we introduce separate dual lines, including Behavior Line and Modal Line, in which the Modal-specific Encoder is applied to empower modal representations. To address the issue of Modal Information Forgetting, we introduce the Similarity Supervised Signal to constrain the modal representations. Additionally, we design a Behavior-Modal Alignment module to fuse the dual representations through Intra-Alignment and Inter-Alignment. Extensive experiments on three public datasets demonstrate that the proposed DREAM method achieves state-of-the-art (SOTA) results. The source code will be available upon acceptance.
Paper Structure (38 sections, 14 equations, 11 figures, 5 tables)

This paper contains 38 sections, 14 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: (a) When Modal Embeddings are frozen, the model performance drops; (b) The Cosine Distance between learnable modal embeddings and original modal features. Compared to previous works (e.g. VBPR, BM3, FREEDOM, MGCN), the modality embeddings in DREAM converge more quickly and effectively maintain the original modality information through Similarity Supervised Signal.
  • Figure 2: (a) The cosine similarity between dual domains of previous works. (b) The performance improvement of previous works through introducing the BMA, with model structure and hyperparameters unchanged.
  • Figure 3: The overview of DREAM: (a) Behavior Line utilizes ID embedding for behavior representation learning. (b) Modal Line focuses on utilizing the multimodal features through Modal-specific filter gates (Fig. \ref{['filter_gate']}) and relation graphs. (c) Similarity-Supervised Signal (Fig. \ref{['S3 figure']}) constraints the learning of modal representations to mitigate the problem of Modal Information Forgetting. (d) Behavior-Modal Alignment module consists of Intra-Alignment and Inter-Alignment for representation alignment and information fusion.
  • Figure 4: The structure of Image-specific Filter Gate in Vision modality. The Text-specific Filter Gate is similar.
  • Figure 5: The Similarity Supervised Signal (S3) ensure that the batch-scale similarity matrix derived from modal representation is as similar as possible to the similarity matrix computed from the original modal features through Mean Square Error (MSE).
  • ...and 6 more figures