Table of Contents
Fetching ...

Enhancing Recipe Retrieval with Foundation Models: A Data Augmentation Perspective

Fangzhou Song, Bin Zhu, Yanbin Hao, Shuo Wang

TL;DR

This work tackles cross-modal recipe retrieval by addressing data misalignment between recipes and food images. It introduces Data Augmented Retrieval (DAR), which uses foundation-model–generated visual imagination descriptions (via Llama2) and ingredient-focused image segments (via SAM) to augment training data, while keeping CLIP frozen and adding lightweight adapters. A multi-level circle loss orchestrates alignment among original and augmented embeddings (image, recipe, segments, and description), achieving state-of-the-art results on Recipe1M with substantially fewer trainable parameters. The framework also demonstrates test-time augmentation benefits, offering improved retrieval performance and interpretability, though it acknowledges limitations in segmentation quality and calls for future improvements in SAM outputs.

Abstract

Learning recipe and food image representation in common embedding space is non-trivial but crucial for cross-modal recipe retrieval. In this paper, we propose a new perspective for this problem by utilizing foundation models for data augmentation. Leveraging on the remarkable capabilities of foundation models (i.e., Llama2 and SAM), we propose to augment recipe and food image by extracting alignable information related to the counterpart. Specifically, Llama2 is employed to generate a textual description from the recipe, aiming to capture the visual cues of a food image, and SAM is used to produce image segments that correspond to key ingredients in the recipe. To make full use of the augmented data, we introduce Data Augmented Retrieval framework (DAR) to enhance recipe and image representation learning for cross-modal retrieval. We first inject adapter layers to pre-trained CLIP model to reduce computation cost rather than fully fine-tuning all the parameters. In addition, multi-level circle loss is proposed to align the original and augmented data pairs, which assigns different penalties for positive and negative pairs. On the Recipe1M dataset, our DAR outperforms all existing methods by a large margin. Extensive ablation studies validate the effectiveness of each component of DAR.

Enhancing Recipe Retrieval with Foundation Models: A Data Augmentation Perspective

TL;DR

This work tackles cross-modal recipe retrieval by addressing data misalignment between recipes and food images. It introduces Data Augmented Retrieval (DAR), which uses foundation-model–generated visual imagination descriptions (via Llama2) and ingredient-focused image segments (via SAM) to augment training data, while keeping CLIP frozen and adding lightweight adapters. A multi-level circle loss orchestrates alignment among original and augmented embeddings (image, recipe, segments, and description), achieving state-of-the-art results on Recipe1M with substantially fewer trainable parameters. The framework also demonstrates test-time augmentation benefits, offering improved retrieval performance and interpretability, though it acknowledges limitations in segmentation quality and calls for future improvements in SAM outputs.

Abstract

Learning recipe and food image representation in common embedding space is non-trivial but crucial for cross-modal recipe retrieval. In this paper, we propose a new perspective for this problem by utilizing foundation models for data augmentation. Leveraging on the remarkable capabilities of foundation models (i.e., Llama2 and SAM), we propose to augment recipe and food image by extracting alignable information related to the counterpart. Specifically, Llama2 is employed to generate a textual description from the recipe, aiming to capture the visual cues of a food image, and SAM is used to produce image segments that correspond to key ingredients in the recipe. To make full use of the augmented data, we introduce Data Augmented Retrieval framework (DAR) to enhance recipe and image representation learning for cross-modal retrieval. We first inject adapter layers to pre-trained CLIP model to reduce computation cost rather than fully fine-tuning all the parameters. In addition, multi-level circle loss is proposed to align the original and augmented data pairs, which assigns different penalties for positive and negative pairs. On the Recipe1M dataset, our DAR outperforms all existing methods by a large margin. Extensive ablation studies validate the effectiveness of each component of DAR.
Paper Structure (22 sections, 10 equations, 8 figures, 9 tables)

This paper contains 22 sections, 10 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Illustration of our proposed data augmentation paradigm using foundation models. LLM generates textual descriptions from the recipe to capture the dish's visual cues, while SAM produces image segments aligned with recipe ingredients.
  • Figure 2: (a) Overview of the DAR framework architecture. (b) Architecture of the adapter in the Transformer layer of CLIP. (c) Prompt of the LLM to generate visual imagination description.
  • Figure 3: Qualitative examples of recipe-to-image retrieval. The query in the left column shows the recipe and the corresponding visual imagination description produced by LLM. The right column shows the retrieved Top-5 food images from DAR++, where blue boxes represent the ground truth. We also highlight the Top-1 retrieved results of DAR with red bounding boxes.
  • Figure 4: Qualitative examples of image-to-recipe retrieval. The first two columns are image query and segments from SAM. The DAR++'s Top-5 retrieved recipes are represented by word clouds. The blue boxes represent the ground truth and red boxes represent the Top-1 retrieved recipe of DAR.
  • Figure 5: Examples of visual imagination descriptions and corresponding images. Note that the visual imagination descriptions are generated based on recipes.
  • ...and 3 more figures