Table of Contents
Fetching ...

Comprehensive Evaluation of Large Multimodal Models for Nutrition Analysis: A New Benchmark Enriched with Contextual Metadata

Bruce Coburn, Jiangpeng He, Megan E. Rollo, Satvinder S. Dhaliwal, Deborah A. Kerr, Fengqing Zhu

TL;DR

The paper tackles improving nutrition analysis from meal images by injecting contextual metadata (GPS, timestamps, and item lists) into prompts for eight Large Multimodal Models, spanning open-weight and closed-weight architectures. It introduces ACETADA, a dietitian-verified, context-rich dataset, and systematically evaluates how metadata and various reasoning modifiers (e.g., Chain-of-Thought, Expert Persona) affect nutrient and portion estimation accuracy, measured by $MAE$ and $MAPE$. Across two experiments, metadata consistently reduces errors, with notable gains for calories and portions, and open-weight models often benefiting most from context without fine-tuning. The findings demonstrate a practical, low-latency approach to boosting nutrition analysis performance in real-world settings and point to future directions in richer metadata integration and architectural incorporation for robust, context-aware dietary monitoring.

Abstract

Large Multimodal Models (LMMs) are increasingly applied to meal images for nutrition analysis. However, existing work primarily evaluates proprietary models, such as GPT-4. This leaves the broad range of LLMs underexplored. Additionally, the influence of integrating contextual metadata and its interaction with various reasoning modifiers remains largely uncharted. This work investigates how interpreting contextual metadata derived from GPS coordinates (converted to location/venue type), timestamps (transformed into meal/day type), and the food items present can enhance LMM performance in estimating key nutritional values. These values include calories, macronutrients (protein, carbohydrates, fat), and portion sizes. We also introduce \textbf{ACETADA}, a new food-image dataset slated for public release. This open dataset provides nutrition information verified by the dietitian and serves as the foundation for our analysis. Our evaluation across eight LMMs (four open-weight and four closed-weight) first establishes the benefit of contextual metadata integration over straightforward prompting with images alone. We then demonstrate how this incorporation of contextual information enhances the efficacy of reasoning modifiers, such as Chain-of-Thought, Multimodal Chain-of-Thought, Scale Hint, Few-Shot, and Expert Persona. Empirical results show that integrating metadata intelligently, when applied through straightforward prompting strategies, can significantly reduce the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) in predicted nutritional values. This work highlights the potential of context-aware LMMs for improved nutrition analysis.

Comprehensive Evaluation of Large Multimodal Models for Nutrition Analysis: A New Benchmark Enriched with Contextual Metadata

TL;DR

The paper tackles improving nutrition analysis from meal images by injecting contextual metadata (GPS, timestamps, and item lists) into prompts for eight Large Multimodal Models, spanning open-weight and closed-weight architectures. It introduces ACETADA, a dietitian-verified, context-rich dataset, and systematically evaluates how metadata and various reasoning modifiers (e.g., Chain-of-Thought, Expert Persona) affect nutrient and portion estimation accuracy, measured by and . Across two experiments, metadata consistently reduces errors, with notable gains for calories and portions, and open-weight models often benefiting most from context without fine-tuning. The findings demonstrate a practical, low-latency approach to boosting nutrition analysis performance in real-world settings and point to future directions in richer metadata integration and architectural incorporation for robust, context-aware dietary monitoring.

Abstract

Large Multimodal Models (LMMs) are increasingly applied to meal images for nutrition analysis. However, existing work primarily evaluates proprietary models, such as GPT-4. This leaves the broad range of LLMs underexplored. Additionally, the influence of integrating contextual metadata and its interaction with various reasoning modifiers remains largely uncharted. This work investigates how interpreting contextual metadata derived from GPS coordinates (converted to location/venue type), timestamps (transformed into meal/day type), and the food items present can enhance LMM performance in estimating key nutritional values. These values include calories, macronutrients (protein, carbohydrates, fat), and portion sizes. We also introduce \textbf{ACETADA}, a new food-image dataset slated for public release. This open dataset provides nutrition information verified by the dietitian and serves as the foundation for our analysis. Our evaluation across eight LMMs (four open-weight and four closed-weight) first establishes the benefit of contextual metadata integration over straightforward prompting with images alone. We then demonstrate how this incorporation of contextual information enhances the efficacy of reasoning modifiers, such as Chain-of-Thought, Multimodal Chain-of-Thought, Scale Hint, Few-Shot, and Expert Persona. Empirical results show that integrating metadata intelligently, when applied through straightforward prompting strategies, can significantly reduce the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) in predicted nutritional values. This work highlights the potential of context-aware LMMs for improved nutrition analysis.

Paper Structure

This paper contains 21 sections, 2 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Contextual metadata overview. Location, time, and food context can be combined with the meal photo and the "base prompt" ("Analyze the food image and estimate ..."). This enriched prompt is passed to an LMM to enhance absolute error and absolute percentage error. In this instance, caloric absolute error and caloric absolute percentage error improve by 100 and 23.42 points, respectively. Aggregated results appear in the Results section.
  • Figure 2: Overview of nutrient, portion weight, and energy distributions in the ACETADA dataset. Subplots (a-e) are histograms where the y-axis represents frequency (count of meals), showing per-meal distributions for: (a) Energy (kcal), (b) Carbohydrates (g), (c) Fat (g), (d) Protein (g), and (e) Overall Portion weight (g). Subplot (f) is a box plot illustrating the distribution of Energy (kcal) for different meal types (Breakfast, Lunch, Dinner), where the y-axis represents Energy (kcal).
  • Figure 3: Meal composition characteristics in the ACETADA dataset: (a) Histogram showing the distribution of the number of distinct food items recorded per meal. (b) Pie chart illustrating the proportional breakdown of identified cuisine categories.
  • Figure 4: Example of ACETADA images across breakfast, lunch, and dinner with corresponding available contextual metadata. In this instance, images are taken by the same participant.
  • Figure 5: Prompt-construction flowchart. Metadata flags (orange)—gps, timestamp, food—and reasoning modifiers (blue)—cot, mmcot, scale, fewshot, expert—optionally augment a base nutrition-analysis prompt (grey) before being sent, with the meal image, to the LMM.
  • ...and 8 more figures