Table of Contents
Fetching ...

Fine-tuning Language Models for Recipe Generation: A Comparative Analysis and Benchmark Study

Anneketh Vij, Changhao Liu, Rahul Anil Nair, Theodore Eugene Ho, Edward Shi, Ayan Bhowmick

TL;DR

This paper tackles safe recipe generation by systematically fine-tuning very small to large language models and by developing a multi-dimensional evaluation framework that includes domain-specific metrics and allergen substitution strategies. It compares architectures from GPT-2 and T5-small to SmolLM and Phi-2, and introduces two allergen-substitution approaches: prompt-based and RAG-assisted, using the Food.com dataset. Key findings show larger models often improve standard metrics but exhibit nuanced behavior on domain-specific measures, with Phi-2 showing degradation after fine-tuning and allergen substitution revealing trade-offs between safety and recipe quality. The work advances practical, domain-aware NLG for culinary applications, highlighting the need for robust evaluation and safe substitution systems in real-world recipe-generation tasks.

Abstract

This research presents an exploration and study of the recipe generation task by fine-tuning various very small language models, with a focus on developing robust evaluation metrics and comparing across different language models the open-ended task of recipe generation. This study presents extensive experiments with multiple model architectures, ranging from T5-small (Raffel et al., 2023) and SmolLM-135M(Allal et al., 2024) to Phi-2 (Research, 2023), implementing both traditional NLP metrics and custom domain-specific evaluation metrics. Our novel evaluation framework incorporates recipe-specific metrics for assessing content quality and introduces approaches to allergen substitution. The results indicate that, while larger models generally perform better on standard metrics, the relationship between model size and recipe quality is more nuanced when considering domain-specific metrics. SmolLM-360M and SmolLM-1.7B demonstrate comparable performance despite their size difference before and after fine-tuning, while fine-tuning Phi-2 shows notable limitations in recipe generation despite its larger parameter count. The comprehensive evaluation framework and allergen substitution systems provide valuable insights for future work in recipe generation and broader NLG tasks that require domain expertise and safety considerations.

Fine-tuning Language Models for Recipe Generation: A Comparative Analysis and Benchmark Study

TL;DR

This paper tackles safe recipe generation by systematically fine-tuning very small to large language models and by developing a multi-dimensional evaluation framework that includes domain-specific metrics and allergen substitution strategies. It compares architectures from GPT-2 and T5-small to SmolLM and Phi-2, and introduces two allergen-substitution approaches: prompt-based and RAG-assisted, using the Food.com dataset. Key findings show larger models often improve standard metrics but exhibit nuanced behavior on domain-specific measures, with Phi-2 showing degradation after fine-tuning and allergen substitution revealing trade-offs between safety and recipe quality. The work advances practical, domain-aware NLG for culinary applications, highlighting the need for robust evaluation and safe substitution systems in real-world recipe-generation tasks.

Abstract

This research presents an exploration and study of the recipe generation task by fine-tuning various very small language models, with a focus on developing robust evaluation metrics and comparing across different language models the open-ended task of recipe generation. This study presents extensive experiments with multiple model architectures, ranging from T5-small (Raffel et al., 2023) and SmolLM-135M(Allal et al., 2024) to Phi-2 (Research, 2023), implementing both traditional NLP metrics and custom domain-specific evaluation metrics. Our novel evaluation framework incorporates recipe-specific metrics for assessing content quality and introduces approaches to allergen substitution. The results indicate that, while larger models generally perform better on standard metrics, the relationship between model size and recipe quality is more nuanced when considering domain-specific metrics. SmolLM-360M and SmolLM-1.7B demonstrate comparable performance despite their size difference before and after fine-tuning, while fine-tuning Phi-2 shows notable limitations in recipe generation despite its larger parameter count. The comprehensive evaluation framework and allergen substitution systems provide valuable insights for future work in recipe generation and broader NLG tasks that require domain expertise and safety considerations.

Paper Structure

This paper contains 40 sections, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Occurrence rate of the 30 most Frequent Ingredients
  • Figure 2: Ingredient Count Distribution (Smoothed KDE)
  • Figure 3: Distribution of Tokenized Length (Ingredient+Steps)
  • Figure 4: Experimental RAG-based Allergen Substitution System Workflow
  • Figure 5: Comparison between Baseline and Fine-Tuned-SmolLm360
  • ...and 5 more figures