Table of Contents
Fetching ...

Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation

Guoshan Liu, Bin Zhu, Yian Li, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang

TL;DR

This work proposes a semantically grounded framework that predicts and validates actions and ingredients as internal context for instruction generation and shows state-of-the-art performance and markedly improved semantic fidelity.

Abstract

Recent advances in Multimodal Large Language Models (MLMMs) have enabled recipe generation from food images, yet outputs often contain semantically incorrect actions or ingredients despite high lexical scores (e.g., BLEU, ROUGE). To address this gap, we propose a semantically grounded framework that predicts and validates actions and ingredients as internal context for instruction generation. Our two-stage pipeline combines supervised fine-tuning (SFT) with reinforcement fine-tuning (RFT): SFT builds foundational accuracy using an Action-Reasoning dataset and ingredient corpus, while RFT employs frequency-aware rewards to improve long-tail action prediction and ingredient generalization. A Semantic Confidence Scoring and Rectification (SCSR) module further filters and corrects predictions. Experiments on Recipe1M show state-of-the-art performance and markedly improved semantic fidelity.

Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation

TL;DR

This work proposes a semantically grounded framework that predicts and validates actions and ingredients as internal context for instruction generation and shows state-of-the-art performance and markedly improved semantic fidelity.

Abstract

Recent advances in Multimodal Large Language Models (MLMMs) have enabled recipe generation from food images, yet outputs often contain semantically incorrect actions or ingredients despite high lexical scores (e.g., BLEU, ROUGE). To address this gap, we propose a semantically grounded framework that predicts and validates actions and ingredients as internal context for instruction generation. Our two-stage pipeline combines supervised fine-tuning (SFT) with reinforcement fine-tuning (RFT): SFT builds foundational accuracy using an Action-Reasoning dataset and ingredient corpus, while RFT employs frequency-aware rewards to improve long-tail action prediction and ingredient generalization. A Semantic Confidence Scoring and Rectification (SCSR) module further filters and corrects predictions. Experiments on Recipe1M show state-of-the-art performance and markedly improved semantic fidelity.
Paper Structure (14 sections, 11 equations, 3 figures, 4 tables)

This paper contains 14 sections, 11 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Limitations of lexical metrics (e.g., SacreBLEU, ROUGE-L) in recipe generation. Comparison of LLaVA-FT output (a) with two semantically corrupted instructions: (b) correct actions but wrong ingredients, (c) correct ingredients but wrong actions. Red/green: incorrect/correct; yellow: actions; blue: ingredients.
  • Figure 2: Overview of our Semantically-Grounded Recipe Generation Framework. The method prioritizes semantic integrity by modeling and verifying ingredients and cooking actions before instruction generation. A two-stage pipeline (SFT→GRPO) trains action/ingredient models; actions use CoT (AR-SFT) and frequency-aware rewards (AR-RFT) for long-tail. SCSR (MLLM, e.g., GPT-4o) scores/rectifies predictions; the generator conditions on the rectified labels.
  • Figure 3: Qualitative Results. 'GT' indicates the ground truth. Green text highlights correctly predicted cooking actions, ingredients, or phrases, while red text indicates incorrect predictions. Our model generates more complete, semantically aligned recipes than LLaVA-FT.