Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation

Guoshan Liu; Bin Zhu; Yian Li; Jingjing Chen; Chong-Wah Ngo; Yu-Gang Jiang

Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation

Guoshan Liu, Bin Zhu, Yian Li, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang

TL;DR

This work proposes a semantically grounded framework that predicts and validates actions and ingredients as internal context for instruction generation and shows state-of-the-art performance and markedly improved semantic fidelity.

Abstract

Recent advances in Multimodal Large Language Models (MLMMs) have enabled recipe generation from food images, yet outputs often contain semantically incorrect actions or ingredients despite high lexical scores (e.g., BLEU, ROUGE). To address this gap, we propose a semantically grounded framework that predicts and validates actions and ingredients as internal context for instruction generation. Our two-stage pipeline combines supervised fine-tuning (SFT) with reinforcement fine-tuning (RFT): SFT builds foundational accuracy using an Action-Reasoning dataset and ingredient corpus, while RFT employs frequency-aware rewards to improve long-tail action prediction and ingredient generalization. A Semantic Confidence Scoring and Rectification (SCSR) module further filters and corrects predictions. Experiments on Recipe1M show state-of-the-art performance and markedly improved semantic fidelity.

Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation

TL;DR

Abstract

Paper Structure (14 sections, 11 equations, 3 figures, 4 tables)

This paper contains 14 sections, 11 equations, 3 figures, 4 tables.

Introduction
Method
Two-Stage Training for Cooking Action Prediction
Two-Stage Training for Ingredient Recognition
Semantic Confidence Scoring and Rectification
Instruction Generation with Action-Ingredient Context Prompting
Experiments and Results
Implementation Details
Datasets and Metrics
Performance Comparison
Ablation Study
Qualitative Results
Conclusion
ACKNOWLEDGEMENT

Figures (3)

Figure 1: Limitations of lexical metrics (e.g., SacreBLEU, ROUGE-L) in recipe generation. Comparison of LLaVA-FT output (a) with two semantically corrupted instructions: (b) correct actions but wrong ingredients, (c) correct ingredients but wrong actions. Red/green: incorrect/correct; yellow: actions; blue: ingredients.
Figure 2: Overview of our Semantically-Grounded Recipe Generation Framework. The method prioritizes semantic integrity by modeling and verifying ingredients and cooking actions before instruction generation. A two-stage pipeline (SFT→GRPO) trains action/ingredient models; actions use CoT (AR-SFT) and frequency-aware rewards (AR-RFT) for long-tail. SCSR (MLLM, e.g., GPT-4o) scores/rectifies predictions; the generator conditions on the rectified labels.
Figure 3: Qualitative Results. 'GT' indicates the ground truth. Green text highlights correctly predicted cooking actions, ingredients, or phrases, while red text indicates incorrect predictions. Our model generates more complete, semantically aligned recipes than LLaVA-FT.

Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation

TL;DR

Abstract

Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)