Table of Contents
Fetching ...

FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation

Yuki Imajuku, Yoko Yamakata, Kiyoharu Aizawa

TL;DR

This work tackles generating Japanese recipe text from food images by fine-tuning open multimodal LLMs (LLaVA-1.5 and Phi-3 Vision) on a large Rakuten Recipe corpus and diverse non-food data. It introduces a 50-category evaluation scheme and leverages GPT-4o as an external baseline to assess ingredients and procedures without normalization, showing open MLLMs achieve higher ingredient-generation accuracy (F1 of 0.531) and competitive procedure-generation quality. The study demonstrates that open models can surpass a leading closed model in key cooking-content tasks and highlights the value of non-food data for robustness. These results advance practical, multilingual food understanding and enable broader, accessible deployment of recipe-generation capabilities across languages and cultures.

Abstract

Research on food image understanding using recipe data has been a long-standing focus due to the diversity and complexity of the data. Moreover, food is inextricably linked to people's lives, making it a vital research area for practical applications such as dietary management. Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities, not only in their vast knowledge but also in their ability to handle languages naturally. While English is predominantly used, they can also support multiple languages including Japanese. This suggests that MLLMs are expected to significantly improve performance in food image understanding tasks. We fine-tuned open MLLMs LLaVA-1.5 and Phi-3 Vision on a Japanese recipe dataset and benchmarked their performance against the closed model GPT-4o. We then evaluated the content of generated recipes, including ingredients and cooking procedures, using 5,000 evaluation samples that comprehensively cover Japanese food culture. Our evaluation demonstrates that the open models trained on recipe data outperform GPT-4o, the current state-of-the-art model, in ingredient generation. Our model achieved F1 score of 0.531, surpassing GPT-4o's F1 score of 0.481, indicating a higher level of accuracy. Furthermore, our model exhibited comparable performance to GPT-4o in generating cooking procedure text.

FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation

TL;DR

This work tackles generating Japanese recipe text from food images by fine-tuning open multimodal LLMs (LLaVA-1.5 and Phi-3 Vision) on a large Rakuten Recipe corpus and diverse non-food data. It introduces a 50-category evaluation scheme and leverages GPT-4o as an external baseline to assess ingredients and procedures without normalization, showing open MLLMs achieve higher ingredient-generation accuracy (F1 of 0.531) and competitive procedure-generation quality. The study demonstrates that open models can surpass a leading closed model in key cooking-content tasks and highlights the value of non-food data for robustness. These results advance practical, multilingual food understanding and enable broader, accessible deployment of recipe-generation capabilities across languages and cultures.

Abstract

Research on food image understanding using recipe data has been a long-standing focus due to the diversity and complexity of the data. Moreover, food is inextricably linked to people's lives, making it a vital research area for practical applications such as dietary management. Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities, not only in their vast knowledge but also in their ability to handle languages naturally. While English is predominantly used, they can also support multiple languages including Japanese. This suggests that MLLMs are expected to significantly improve performance in food image understanding tasks. We fine-tuned open MLLMs LLaVA-1.5 and Phi-3 Vision on a Japanese recipe dataset and benchmarked their performance against the closed model GPT-4o. We then evaluated the content of generated recipes, including ingredients and cooking procedures, using 5,000 evaluation samples that comprehensively cover Japanese food culture. Our evaluation demonstrates that the open models trained on recipe data outperform GPT-4o, the current state-of-the-art model, in ingredient generation. Our model achieved F1 score of 0.531, surpassing GPT-4o's F1 score of 0.481, indicating a higher level of accuracy. Furthermore, our model exhibited comparable performance to GPT-4o in generating cooking procedure text.
Paper Structure (19 sections, 5 figures, 4 tables)

This paper contains 19 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of our models. Up: the example of generated recipe text from input food image. Down: the example of generated refusal text from input non-food image. Both of them are in Japanese and generated from a our model.
  • Figure 2: 50-categories we created for test data.
  • Figure 3: Train data desciption. Upper Left: the format of recipes. Lower Left: the format of refusal text Right: description of six patterns and example user query prompts.
  • Figure 4: The actual prompts used for GPT-4o inferences. Left: the prompt for recipe generation. Right: the prompt for ingredients comparison between generated recipe and ground truth. Both of them are in Japanese.
  • Figure 5: Example outputs of our models. Left: the difficult food example. Right: the non-food image example.