Table of Contents
Fetching ...

Retrieval Augmented Recipe Generation

Guoshan Liu, Hailong Yin, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang

TL;DR

The paper tackles hallucinations in one-stage vision-language models for recipe generation from food images by introducing a retrieval-augmented framework. It combines a cross-modal retriever (Revamping) with a frozen LLaVA backbone enhanced by LoRA, augmented with Stochastic Diversified Retrieval Augmentation (SDRA) and a Self-consistency Ensemble Voting strategy to select consensus outputs. Empirical results on Recipe1M show state-of-the-art performance in both recipe generation and ingredient recognition, validating the effectiveness of external retrieval context and output-consensus mechanisms. The approach offers a practical path to more reliable, context-aware food understanding and recipe synthesis, with potential extensions such as self-reflection for further improvements.

Abstract

Given the potential applications of generating recipes from food images, this area has garnered significant attention from researchers in recent years. Existing works for recipe generation primarily utilize a two-stage training method, first generating ingredients and then obtaining instructions from both the image and ingredients. Large Multi-modal Models (LMMs), which have achieved notable success across a variety of vision and language tasks, shed light to generating both ingredients and instructions directly from images. Nevertheless, LMMs still face the common issue of hallucinations during recipe generation, leading to suboptimal performance. To tackle this, we propose a retrieval augmented large multimodal model for recipe generation. We first introduce Stochastic Diversified Retrieval Augmentation (SDRA) to retrieve recipes semantically related to the image from an existing datastore as a supplement, integrating them into the prompt to add diverse and rich context to the input image. Additionally, Self-Consistency Ensemble Voting mechanism is proposed to determine the most confident prediction recipes as the final output. It calculates the consistency among generated recipe candidates, which use different retrieval recipes as context for generation. Extensive experiments validate the effectiveness of our proposed method, which demonstrates state-of-the-art (SOTA) performance in recipe generation tasks on the Recipe1M dataset.

Retrieval Augmented Recipe Generation

TL;DR

The paper tackles hallucinations in one-stage vision-language models for recipe generation from food images by introducing a retrieval-augmented framework. It combines a cross-modal retriever (Revamping) with a frozen LLaVA backbone enhanced by LoRA, augmented with Stochastic Diversified Retrieval Augmentation (SDRA) and a Self-consistency Ensemble Voting strategy to select consensus outputs. Empirical results on Recipe1M show state-of-the-art performance in both recipe generation and ingredient recognition, validating the effectiveness of external retrieval context and output-consensus mechanisms. The approach offers a practical path to more reliable, context-aware food understanding and recipe synthesis, with potential extensions such as self-reflection for further improvements.

Abstract

Given the potential applications of generating recipes from food images, this area has garnered significant attention from researchers in recent years. Existing works for recipe generation primarily utilize a two-stage training method, first generating ingredients and then obtaining instructions from both the image and ingredients. Large Multi-modal Models (LMMs), which have achieved notable success across a variety of vision and language tasks, shed light to generating both ingredients and instructions directly from images. Nevertheless, LMMs still face the common issue of hallucinations during recipe generation, leading to suboptimal performance. To tackle this, we propose a retrieval augmented large multimodal model for recipe generation. We first introduce Stochastic Diversified Retrieval Augmentation (SDRA) to retrieve recipes semantically related to the image from an existing datastore as a supplement, integrating them into the prompt to add diverse and rich context to the input image. Additionally, Self-Consistency Ensemble Voting mechanism is proposed to determine the most confident prediction recipes as the final output. It calculates the consistency among generated recipe candidates, which use different retrieval recipes as context for generation. Extensive experiments validate the effectiveness of our proposed method, which demonstrates state-of-the-art (SOTA) performance in recipe generation tasks on the Recipe1M dataset.

Paper Structure

This paper contains 26 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: (a) The structural differences between our retrieval-augmented framework and the "two-stage" Inverse_CookingFIRE and "LMMs-based" LLaVAFoodLMM approaches. "G" refers to the generator. (b) Recipe generation results comparison. "GT" refers to ground truth, "LLaVA-FT" denotes the model using pre-trained LLaVA weights fine-tuned on Recipe1M, "Inverse cooking" Inverse_Cooking represents a model trained with two-stage, "FoodLMM" FoodLMM is the LMMs-based model for recipe generation, and "Ours" refers to our model, where yellow highlights indicate ingredients that match those in the "GT", blue signifies cooking instructions predictions matching the "GT", and red font denotes incorrectly predicted ingredients.
  • Figure 2: Templates for Recipe Generation.
  • Figure 3: Overview of our proposed model architecture. Our model consists of a retriever to search semantically similar recipes from the image as reference, and a generator based on a frozen LLaVA LLaVA with a trainable LoRA lora to generate recipe with the image and retrieved recipes. Stochastic Diversified Retrieval Augmentation is introduced by using retrieved ingredients and instructions, to form Recipe demonstration $R$, and fed into the generator for training. Self-consistency Ensemble Voting is proposed to select the final recipe output based on mutual agreement among the recipe candidates, which are produced by using each recipe from top 1 to top s retrieved recipes as context.
  • Figure 4: Qualitative results. The ingredients in generated recipes that overlap with ground truth ("GT") are highlighted in yellow, while details in the instructions that match the GT are shown in blue. Otherwise, the incorrect generation results are displayed in red. Best viewed in color.
  • Figure 5: Comparison between generated recipes and GT recipes. The highlights in yellow indicate ingredients that match those in the GT, ingredients incorrectly identified by the model are signified in red.
  • ...and 2 more figures