Food Portion Estimation: From Pixels to Calories
Gautham Vinod, Fengqing Zhu
TL;DR
This survey addresses the challenge of estimating real-world food portions from 2D images, focusing on the scale ambiguity that impedes translating image data into volume and calories. It surveys geometric approaches—Specialized Depth Sensors, Multi-View Stereo, and Model-based/template methods—and contrasts them with monocular deep learning strategies that infer depth or energy directly from RGB input, including Monocular Depth Prediction, Direct Energy Regression, and Implicit Representations like NeRFs. Three major bottlenecks are identified: scale/reference, occlusion, and the density gap between volume and caloric content; the authors discuss marker-free scale estimation, amodal completion with diffusion models, and density-aware multimodal fusion with large language models and dietary databases. The work emphasizes a paradigm shift toward passive, AI-driven monocular inference, with practical significance for automating dietary tracking and chronic disease management through improved usability and integration with semantic priors. $V$ and $\rho$ are central to moving from volume to energy, highlighting the need for robust density estimation alongside accurate volume reconstruction. $Z = \frac{f \cdot B}{d}$ and $D(u,v)$ exemplify the geometric foundations that modern learning-based methods seek to replace or augment with learned priors and multimodal cues.
Abstract
Reliance on images for dietary assessment is an important strategy to accurately and conveniently monitor an individual's health, making it a vital mechanism in the prevention and care of chronic diseases and obesity. However, image-based dietary assessment suffers from estimating the three dimensional size of food from 2D image inputs. Many strategies have been devised to overcome this critical limitation such as the use of auxiliary inputs like depth maps, multi-view inputs, or model-based approaches such as template matching. Deep learning also helps bridge the gap by either using monocular images or combinations of the image and the auxillary inputs to precisely predict the output portion from the image input. In this paper, we explore the different strategies employed for accurate portion estimation.
