Table of Contents
Fetching ...

NutritionVerse: Empirical Study of Various Dietary Intake Estimation Approaches

Chi-en Amy Tai, Matthew Keller, Saeejith Nair, Yuhao Chen, Yifan Wu, Olivia Markham, Krish Parmar, Pengcheng Xi, Heather Keller, Sharon Kirkpatrick, Alexander Wong

TL;DR

NutritionVerse addresses biased dietary assessment by introducing NV-Synth and NV-Real, two public datasets for multimodal dietary sensing. The paper benchmarks direct nutrient prediction against indirect segmentation-based approaches and evaluates the role of depth information and synthetic-real data fusion. Key findings show that direct prediction with real-data pretraining yields the strongest real-world performance, while synthetic data provides benefits mainly when fine-tuned with real data. The work establishes a resource and empirical framework that enables systematic comparison of dietary intake estimation methods and spurs further progress in multimodal dietary assessment.

Abstract

Accurate dietary intake estimation is critical for informing policies and programs to support healthy eating, as malnutrition has been directly linked to decreased quality of life. However self-reporting methods such as food diaries suffer from substantial bias. Other conventional dietary assessment techniques and emerging alternative approaches such as mobile applications incur high time costs and may necessitate trained personnel. Recent work has focused on using computer vision and machine learning to automatically estimate dietary intake from food images, but the lack of comprehensive datasets with diverse viewpoints, modalities and food annotations hinders the accuracy and realism of such methods. To address this limitation, we introduce NutritionVerse-Synth, the first large-scale dataset of 84,984 photorealistic synthetic 2D food images with associated dietary information and multimodal annotations (including depth images, instance masks, and semantic masks). Additionally, we collect a real image dataset, NutritionVerse-Real, containing 889 images of 251 dishes to evaluate realism. Leveraging these novel datasets, we develop and benchmark NutritionVerse, an empirical study of various dietary intake estimation approaches, including indirect segmentation-based and direct prediction networks. We further fine-tune models pretrained on synthetic data with real images to provide insights into the fusion of synthetic and real data. Finally, we release both datasets (NutritionVerse-Synth, NutritionVerse-Real) on https://www.kaggle.com/nutritionverse/datasets as part of an open initiative to accelerate machine learning for dietary sensing.

NutritionVerse: Empirical Study of Various Dietary Intake Estimation Approaches

TL;DR

NutritionVerse addresses biased dietary assessment by introducing NV-Synth and NV-Real, two public datasets for multimodal dietary sensing. The paper benchmarks direct nutrient prediction against indirect segmentation-based approaches and evaluates the role of depth information and synthetic-real data fusion. Key findings show that direct prediction with real-data pretraining yields the strongest real-world performance, while synthetic data provides benefits mainly when fine-tuned with real data. The work establishes a resource and empirical framework that enables systematic comparison of dietary intake estimation methods and spurs further progress in multimodal dietary assessment.

Abstract

Accurate dietary intake estimation is critical for informing policies and programs to support healthy eating, as malnutrition has been directly linked to decreased quality of life. However self-reporting methods such as food diaries suffer from substantial bias. Other conventional dietary assessment techniques and emerging alternative approaches such as mobile applications incur high time costs and may necessitate trained personnel. Recent work has focused on using computer vision and machine learning to automatically estimate dietary intake from food images, but the lack of comprehensive datasets with diverse viewpoints, modalities and food annotations hinders the accuracy and realism of such methods. To address this limitation, we introduce NutritionVerse-Synth, the first large-scale dataset of 84,984 photorealistic synthetic 2D food images with associated dietary information and multimodal annotations (including depth images, instance masks, and semantic masks). Additionally, we collect a real image dataset, NutritionVerse-Real, containing 889 images of 251 dishes to evaluate realism. Leveraging these novel datasets, we develop and benchmark NutritionVerse, an empirical study of various dietary intake estimation approaches, including indirect segmentation-based and direct prediction networks. We further fine-tune models pretrained on synthetic data with real images to provide insights into the fusion of synthetic and real data. Finally, we release both datasets (NutritionVerse-Synth, NutritionVerse-Real) on https://www.kaggle.com/nutritionverse/datasets as part of an open initiative to accelerate machine learning for dietary sensing.
Paper Structure (18 sections, 8 figures, 8 tables)

This paper contains 18 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Sample scene from NV-Synth dataset with the associated multi-modal image data (e.g., RGB and depth data) and annotation metadata (e.g., instance and semantic segmentation masks) derived using objects from the NutritionVerse-3D dataset nutritionverse-3d. There are 2 meatloaves, 1 chicken leg, 1 chicken wing, 1 pork rib, and 2 sushi rolls in this scene.
  • Figure 2: An example food scene from NV-Synth with two different camera angles.
  • Figure 3: An example food scene from NV-Real with two different camera angles.
  • Figure 4: The blue segmentation annotates a chicken wing that is partially occluded by a chicken leg in amodal instance compared to instance segmentation.
  • Figure 5: Example segmentation mask for a food dish with a half bread loaf (left) and lasagna (right) for nutrition calculation demonstration. The half bread loaf has a mask with 273,529 pixels, and the lasagna has a mask with 512,985 pixels.
  • ...and 3 more figures