How Much You Ate? Food Portion Estimation on Spoons
Aaryam Sharma, Chris Czarnecki, Yuhao Chen, Pengcheng Xi, Linlin Xu, Alexander Wong
TL;DR
This work presents a utensil-centered, stationary front-facing camera approach for dietary portion estimation that tracks food on utensils through a three-stage pipeline: segmentation (via HQ-SAM with Grounding DINO or XMem VOS), key-frame detection, and volume estimation using geometric shapes (Prism, Ellipsoid, Hemisphere). It introduces a Spurious Segmentation Filtering mechanism and demonstrates that Ellipsoid fitting with XMem-based VOS and filtering achieves strong performance on ten rice-on-spoon videos with ground-truth volumes, underscoring the method's potential for non-invasive, accurate monitoring of liquid-solid meals like soups and stews. The study also discusses the trade-offs between independent-frame segmentation and video-object segmentation, and emphasizes practical considerations such as frame sampling, per-frame length mapping, and utensil-geometry priors. Overall, the approach offers a promising pathway toward accessible, front-end dietary monitoring suitable for aging-in-place and home-health contexts.
Abstract
Monitoring dietary intake is a crucial aspect of promoting healthy living. In recent years, advances in computer vision technology have facilitated dietary intake monitoring through the use of images and depth cameras. However, the current state-of-the-art image-based food portion estimation algorithms assume that users take images of their meals one or two times, which can be inconvenient and fail to capture food items that are not visible from a top-down perspective, such as ingredients submerged in a stew. To address these limitations, we introduce an innovative solution that utilizes stationary user-facing cameras to track food items on utensils, not requiring any change of camera perspective after installation. The shallow depth of utensils provides a more favorable angle for capturing food items, and tracking them on the utensil's surface offers a significantly more accurate estimation of dietary intake without the need for post-meal image capture. The system is reliable for estimation of nutritional content of liquid-solid heterogeneous mixtures such as soups and stews. Through a series of experiments, we demonstrate the exceptional potential of our method as a non-invasive, user-friendly, and highly accurate dietary intake monitoring tool.
