Table of Contents
Fetching ...

Multi-modal Food Recommendation using Clustering and Self-supervised Learning

Yixin Zhang, Xin Zhou, Qianwen Meng, Fanglin Zhu, Yonghui Xu, Zhiqi Shen, Lizhen Cui

TL;DR

This work tackles the challenge of leveraging rich multi-modal recipe content without diluting collaborative signals in food recommendations. It introduces CLUSSL, a clustering and self-supervised learning framework that builds modality-specific graphs from discrete ingredients and continuous visual/textual features, then learns recipe representations via graph convolution and a cross-modal distance-correlation objective. The key contributions are the prototype-based continuous graphs, a distance-correlation regularizer to enforce independence across modalities, and a LightGCN-based backbone that jointly optimizes for recommendation accuracy. Across Allrecipes and Food.com, CLUSSL achieves consistent state-of-the-art performance, demonstrating the practical value of transforming semantic multimodal information into structured representations and of self-supervised cross-modal regularization for robust, accurate recommendations.

Abstract

Food recommendation systems serve as pivotal components in the realm of digital lifestyle services, designed to assist users in discovering recipes and food items that resonate with their unique dietary predilections. Typically, multi-modal descriptions offer an exhaustive profile for each recipe, thereby ensuring recommendations that are both personalized and accurate. Our preliminary investigation of two datasets indicates that pre-trained multi-modal dense representations might precipitate a deterioration in performance compared to ID features when encapsulating interactive relationships. This observation implies that ID features possess a relative superiority in modeling interactive collaborative signals. Consequently, contemporary cutting-edge methodologies augment ID features with multi-modal information as supplementary features, overlooking the latent semantic relations between recipes. To rectify this, we present CLUSSL, a novel food recommendation framework that employs clustering and self-supervised learning. Specifically, CLUSSL formulates a modality-specific graph tailored to each modality with discrete/continuous features, thereby transforming semantic features into structural representation. Furthermore, CLUSSL procures recipe representations pertinent to different modalities via graph convolutional operations. A self-supervised learning objective is proposed to foster independence between recipe representations derived from different unimodal graphs. Comprehensive experiments on real-world datasets substantiate that CLUSSL consistently surpasses state-of-the-art recommendation benchmarks in performance.

Multi-modal Food Recommendation using Clustering and Self-supervised Learning

TL;DR

This work tackles the challenge of leveraging rich multi-modal recipe content without diluting collaborative signals in food recommendations. It introduces CLUSSL, a clustering and self-supervised learning framework that builds modality-specific graphs from discrete ingredients and continuous visual/textual features, then learns recipe representations via graph convolution and a cross-modal distance-correlation objective. The key contributions are the prototype-based continuous graphs, a distance-correlation regularizer to enforce independence across modalities, and a LightGCN-based backbone that jointly optimizes for recommendation accuracy. Across Allrecipes and Food.com, CLUSSL achieves consistent state-of-the-art performance, demonstrating the practical value of transforming semantic multimodal information into structured representations and of self-supervised cross-modal regularization for robust, accurate recommendations.

Abstract

Food recommendation systems serve as pivotal components in the realm of digital lifestyle services, designed to assist users in discovering recipes and food items that resonate with their unique dietary predilections. Typically, multi-modal descriptions offer an exhaustive profile for each recipe, thereby ensuring recommendations that are both personalized and accurate. Our preliminary investigation of two datasets indicates that pre-trained multi-modal dense representations might precipitate a deterioration in performance compared to ID features when encapsulating interactive relationships. This observation implies that ID features possess a relative superiority in modeling interactive collaborative signals. Consequently, contemporary cutting-edge methodologies augment ID features with multi-modal information as supplementary features, overlooking the latent semantic relations between recipes. To rectify this, we present CLUSSL, a novel food recommendation framework that employs clustering and self-supervised learning. Specifically, CLUSSL formulates a modality-specific graph tailored to each modality with discrete/continuous features, thereby transforming semantic features into structural representation. Furthermore, CLUSSL procures recipe representations pertinent to different modalities via graph convolutional operations. A self-supervised learning objective is proposed to foster independence between recipe representations derived from different unimodal graphs. Comprehensive experiments on real-world datasets substantiate that CLUSSL consistently surpasses state-of-the-art recommendation benchmarks in performance.
Paper Structure (25 sections, 8 equations, 2 figures, 4 tables)

This paper contains 25 sections, 8 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: (a) exemplifies a recipe with its associated modalities, (b) showcases the construction of the modality-specific graph via continuous multi-modal features, and (c) illustrates the overall framework of the proposed CLUSSL.
  • Figure 2: The performance trends of CLUSSL with respect to different settings of coefficient $\lambda$, prototypes number, and top-$k$ nearest on Allrecipes and Food.com datasets.