Table of Contents
Fetching ...

Evaluating Gemini LLM in Food Image-Based Recipe and Nutrition Description with EfficientNet-B4 Visual Backbone

Rizal Khoirul Anam

TL;DR

This work addresses food recognition and nutrition description from images using a decoupled vision-language pipeline that pairs a visual backbone with a large language model. It introduces the Custom Chinese Food Dataset (CCFD) to mitigate cultural bias and defines Semantic Error Propagation (SEP) to quantify how visual errors propagate into generated outputs, with $SEP = \frac{1}{|\mathcal{D}_{err}|} \sum_{I \in \mathcal{D}_{err}} d_{sem}(f_L(p(c_{pred})), f_L(p(c_{true})))$ where $d_{sem} = 1 - \cos(\cdot,\cdot)$. Empirical results show EfficientNet-B4 achieves the best accuracy-efficiency trade-off (Top-1 89.0%) and Gemini 1.5 Pro yields high factual accuracy, yet end-to-end utility is limited by perceptual accuracy due to error amplification by the LLM. The findings underscore the importance of culturally aware datasets and robust perception to make such systems practically useful for automated nutrition guidance and recipe generation.

Abstract

The proliferation of digital food applications necessitates robust methods for automated nutritional analysis and culinary guidance. This paper presents a comprehensive comparative evaluation of a decoupled, multimodal pipeline for food recognition. We evaluate a system integrating a specialized visual backbone (EfficientNet-B4) with a powerful generative large language model (Google's Gemini LLM). The core objective is to evaluate the trade-offs between visual classification accuracy, model efficiency, and the quality of generative output (nutritional data and recipes). We benchmark this pipeline against alternative vision backbones (VGG-16, ResNet-50, YOLOv8) and a lightweight LLM (Gemma). We introduce a formalization for "Semantic Error Propagation" (SEP) to analyze how classification inaccuracies from the visual module cascade into the generative output. Our analysis is grounded in a new Custom Chinese Food Dataset (CCFD) developed to address cultural bias in public datasets. Experimental results demonstrate that while EfficientNet-B4 (89.0\% Top-1 Acc.) provides the best balance of accuracy and efficiency, and Gemini (9.2/10 Factual Accuracy) provides superior generative quality, the system's overall utility is fundamentally bottlenecked by the visual front-end's perceptive accuracy. We conduct a detailed per-class analysis, identifying high semantic similarity as the most critical failure mode.

Evaluating Gemini LLM in Food Image-Based Recipe and Nutrition Description with EfficientNet-B4 Visual Backbone

TL;DR

This work addresses food recognition and nutrition description from images using a decoupled vision-language pipeline that pairs a visual backbone with a large language model. It introduces the Custom Chinese Food Dataset (CCFD) to mitigate cultural bias and defines Semantic Error Propagation (SEP) to quantify how visual errors propagate into generated outputs, with where . Empirical results show EfficientNet-B4 achieves the best accuracy-efficiency trade-off (Top-1 89.0%) and Gemini 1.5 Pro yields high factual accuracy, yet end-to-end utility is limited by perceptual accuracy due to error amplification by the LLM. The findings underscore the importance of culturally aware datasets and robust perception to make such systems practically useful for automated nutrition guidance and recipe generation.

Abstract

The proliferation of digital food applications necessitates robust methods for automated nutritional analysis and culinary guidance. This paper presents a comprehensive comparative evaluation of a decoupled, multimodal pipeline for food recognition. We evaluate a system integrating a specialized visual backbone (EfficientNet-B4) with a powerful generative large language model (Google's Gemini LLM). The core objective is to evaluate the trade-offs between visual classification accuracy, model efficiency, and the quality of generative output (nutritional data and recipes). We benchmark this pipeline against alternative vision backbones (VGG-16, ResNet-50, YOLOv8) and a lightweight LLM (Gemma). We introduce a formalization for "Semantic Error Propagation" (SEP) to analyze how classification inaccuracies from the visual module cascade into the generative output. Our analysis is grounded in a new Custom Chinese Food Dataset (CCFD) developed to address cultural bias in public datasets. Experimental results demonstrate that while EfficientNet-B4 (89.0\% Top-1 Acc.) provides the best balance of accuracy and efficiency, and Gemini (9.2/10 Factual Accuracy) provides superior generative quality, the system's overall utility is fundamentally bottlenecked by the visual front-end's perceptive accuracy. We conduct a detailed per-class analysis, identifying high semantic similarity as the most critical failure mode.

Paper Structure

This paper contains 42 sections, 14 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Efficiency Comparison: Top-1 Accuracy vs. Model Size (Million Parameters). EfficientNet-B4 (top left) shows the best balance.
  • Figure 2: Graphical comparison of Top-1 vs. Top-5 Accuracy. The small gap for EfficientNet-B4 indicates high model confidence.
  • Figure 3: Visual comparison of worst (left) and best (right) performing classes. Failures are centered on dishes with high visual AND semantic similarity.
  • Figure 4: Normalized confusion matrix (simulation) for EfficientNet-B4. Bright off-diagonal spots (e.g., between 'Spicy Crayfish' and 'Spicy Shrimp') indicate model confusion.
  • Figure 5: Graphical comparison of qualitative scores (Relevance, Factual Accuracy, Coherence) between Gemini Pro and Gemma. Gemini consistently outperforms Gemma.
  • ...and 1 more figures