Table of Contents
Fetching ...

LLaVA-Chef: A Multi-modal Generative Model for Food Recipes

Fnu Mohbat, Mohammed J. Zaki

TL;DR

This work evaluates existing LLMs for recipe generation and proposes LLaVA-Chef, a novel model trained on a curated dataset of diverse recipe prompts in a multi-stage approach that demonstrates impressive improvements over pretrained LLMs and prior works.

Abstract

In the rapidly evolving landscape of online recipe sharing within a globalized context, there has been a notable surge in research towards comprehending and generating food recipes. Recent advancements in large language models (LLMs) like GPT-2 and LLaVA have paved the way for Natural Language Processing (NLP) approaches to delve deeper into various facets of food-related tasks, encompassing ingredient recognition and comprehensive recipe generation. Despite impressive performance and multi-modal adaptability of LLMs, domain-specific training remains paramount for their effective application. This work evaluates existing LLMs for recipe generation and proposes LLaVA-Chef, a novel model trained on a curated dataset of diverse recipe prompts in a multi-stage approach. First, we refine the mapping of visual food image embeddings to the language space. Second, we adapt LLaVA to the food domain by fine-tuning it on relevant recipe data. Third, we utilize diverse prompts to enhance the model's recipe comprehension. Finally, we improve the linguistic quality of generated recipes by penalizing the model with a custom loss function. LLaVA-Chef demonstrates impressive improvements over pretrained LLMs and prior works. A detailed qualitative analysis reveals that LLaVA-Chef generates more detailed recipes with precise ingredient mentions, compared to existing approaches.

LLaVA-Chef: A Multi-modal Generative Model for Food Recipes

TL;DR

This work evaluates existing LLMs for recipe generation and proposes LLaVA-Chef, a novel model trained on a curated dataset of diverse recipe prompts in a multi-stage approach that demonstrates impressive improvements over pretrained LLMs and prior works.

Abstract

In the rapidly evolving landscape of online recipe sharing within a globalized context, there has been a notable surge in research towards comprehending and generating food recipes. Recent advancements in large language models (LLMs) like GPT-2 and LLaVA have paved the way for Natural Language Processing (NLP) approaches to delve deeper into various facets of food-related tasks, encompassing ingredient recognition and comprehensive recipe generation. Despite impressive performance and multi-modal adaptability of LLMs, domain-specific training remains paramount for their effective application. This work evaluates existing LLMs for recipe generation and proposes LLaVA-Chef, a novel model trained on a curated dataset of diverse recipe prompts in a multi-stage approach. First, we refine the mapping of visual food image embeddings to the language space. Second, we adapt LLaVA to the food domain by fine-tuning it on relevant recipe data. Third, we utilize diverse prompts to enhance the model's recipe comprehension. Finally, we improve the linguistic quality of generated recipes by penalizing the model with a custom loss function. LLaVA-Chef demonstrates impressive improvements over pretrained LLMs and prior works. A detailed qualitative analysis reveals that LLaVA-Chef generates more detailed recipes with precise ingredient mentions, compared to existing approaches.
Paper Structure (26 sections, 3 equations, 5 figures, 6 tables)

This paper contains 26 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Architecture of LLaVA-Chef and different training stages (as shown in grey). The inputs to the model $X_t$, $X_{ing}$, and $X_i$ refer to the recipe title, ingredients and image, respectively. $Y_{inst}$ refers to the generated recipe instructions (which are compared with the ground truth instructions $X_{inst}$ for loss computation). In training Stage-0 ( S-0), the image to text mapping layer is fine-tuned. Whereas, in the rest of the training stages S-1, S-2, and S-3 the backbone LLM is fine-tuned. Given a recipe, we sample a prompt, then substitute <name> and <ingredients> with $X_t$ and $X_{ing}$. Visual features of the image $X_i$ from CLIP are mapped in language space and concatenated with language embeddings before passing through the backbone LLM. The frozen and trainable symbols indicate which layers are fine-tuned (e.g., CLIP is frozen, whereas mapping layer and LLM are trainable.)
  • Figure 2: Sample recipes generated by LLaVA-Chef model, Chef-Transformer farahani2023chef (open source recipe generation model) and LLaVA li2023llava (best pretrained model). We can see issues of hallucination, repetitive test, and inaccuracies for previous models.
  • Figure 3: Sample recipe from the Recipe1M dataset. Title is denoted $X_t$, image $X_i$, ingredients $X_{ing}$, and instructions $X_{inst}$.
  • Figure 4: Sample recipes produced by the LLaVA-Chef-S3 model.
  • Figure 5: Example recipes generated by pre-trained LLaVA and each stage of our model. We can see how each stage successively improves the generated recipe, showcasing the effectiveness of our multi-stage training.