Table of Contents
Fetching ...

Food Portion Estimation: From Pixels to Calories

Gautham Vinod, Fengqing Zhu

TL;DR

This survey addresses the challenge of estimating real-world food portions from 2D images, focusing on the scale ambiguity that impedes translating image data into volume and calories. It surveys geometric approaches—Specialized Depth Sensors, Multi-View Stereo, and Model-based/template methods—and contrasts them with monocular deep learning strategies that infer depth or energy directly from RGB input, including Monocular Depth Prediction, Direct Energy Regression, and Implicit Representations like NeRFs. Three major bottlenecks are identified: scale/reference, occlusion, and the density gap between volume and caloric content; the authors discuss marker-free scale estimation, amodal completion with diffusion models, and density-aware multimodal fusion with large language models and dietary databases. The work emphasizes a paradigm shift toward passive, AI-driven monocular inference, with practical significance for automating dietary tracking and chronic disease management through improved usability and integration with semantic priors. $V$ and $\rho$ are central to moving from volume to energy, highlighting the need for robust density estimation alongside accurate volume reconstruction. $Z = \frac{f \cdot B}{d}$ and $D(u,v)$ exemplify the geometric foundations that modern learning-based methods seek to replace or augment with learned priors and multimodal cues.

Abstract

Reliance on images for dietary assessment is an important strategy to accurately and conveniently monitor an individual's health, making it a vital mechanism in the prevention and care of chronic diseases and obesity. However, image-based dietary assessment suffers from estimating the three dimensional size of food from 2D image inputs. Many strategies have been devised to overcome this critical limitation such as the use of auxiliary inputs like depth maps, multi-view inputs, or model-based approaches such as template matching. Deep learning also helps bridge the gap by either using monocular images or combinations of the image and the auxillary inputs to precisely predict the output portion from the image input. In this paper, we explore the different strategies employed for accurate portion estimation.

Food Portion Estimation: From Pixels to Calories

TL;DR

This survey addresses the challenge of estimating real-world food portions from 2D images, focusing on the scale ambiguity that impedes translating image data into volume and calories. It surveys geometric approaches—Specialized Depth Sensors, Multi-View Stereo, and Model-based/template methods—and contrasts them with monocular deep learning strategies that infer depth or energy directly from RGB input, including Monocular Depth Prediction, Direct Energy Regression, and Implicit Representations like NeRFs. Three major bottlenecks are identified: scale/reference, occlusion, and the density gap between volume and caloric content; the authors discuss marker-free scale estimation, amodal completion with diffusion models, and density-aware multimodal fusion with large language models and dietary databases. The work emphasizes a paradigm shift toward passive, AI-driven monocular inference, with practical significance for automating dietary tracking and chronic disease management through improved usability and integration with semantic priors. and are central to moving from volume to energy, highlighting the need for robust density estimation alongside accurate volume reconstruction. and exemplify the geometric foundations that modern learning-based methods seek to replace or augment with learned priors and multimodal cues.

Abstract

Reliance on images for dietary assessment is an important strategy to accurately and conveniently monitor an individual's health, making it a vital mechanism in the prevention and care of chronic diseases and obesity. However, image-based dietary assessment suffers from estimating the three dimensional size of food from 2D image inputs. Many strategies have been devised to overcome this critical limitation such as the use of auxiliary inputs like depth maps, multi-view inputs, or model-based approaches such as template matching. Deep learning also helps bridge the gap by either using monocular images or combinations of the image and the auxillary inputs to precisely predict the output portion from the image input. In this paper, we explore the different strategies employed for accurate portion estimation.
Paper Structure (14 sections, 1 equation, 1 figure)

This paper contains 14 sections, 1 equation, 1 figure.

Figures (1)

  • Figure 1: Traditional and Deep Learning Methods. Traditional methods of extrapolating missing 3D information from 2D images rely on depth-based (specialized hardware), multi-view stereo, or model-based methods. Deep learning tries to bridge the gap by using auxiliary inputs or learning to map the input image to the auxiliary input and using combinations of them for portion regression.