Table of Contents
Fetching ...

How Much You Ate? Food Portion Estimation on Spoons

Aaryam Sharma, Chris Czarnecki, Yuhao Chen, Pengcheng Xi, Linlin Xu, Alexander Wong

TL;DR

This work presents a utensil-centered, stationary front-facing camera approach for dietary portion estimation that tracks food on utensils through a three-stage pipeline: segmentation (via HQ-SAM with Grounding DINO or XMem VOS), key-frame detection, and volume estimation using geometric shapes (Prism, Ellipsoid, Hemisphere). It introduces a Spurious Segmentation Filtering mechanism and demonstrates that Ellipsoid fitting with XMem-based VOS and filtering achieves strong performance on ten rice-on-spoon videos with ground-truth volumes, underscoring the method's potential for non-invasive, accurate monitoring of liquid-solid meals like soups and stews. The study also discusses the trade-offs between independent-frame segmentation and video-object segmentation, and emphasizes practical considerations such as frame sampling, per-frame length mapping, and utensil-geometry priors. Overall, the approach offers a promising pathway toward accessible, front-end dietary monitoring suitable for aging-in-place and home-health contexts.

Abstract

Monitoring dietary intake is a crucial aspect of promoting healthy living. In recent years, advances in computer vision technology have facilitated dietary intake monitoring through the use of images and depth cameras. However, the current state-of-the-art image-based food portion estimation algorithms assume that users take images of their meals one or two times, which can be inconvenient and fail to capture food items that are not visible from a top-down perspective, such as ingredients submerged in a stew. To address these limitations, we introduce an innovative solution that utilizes stationary user-facing cameras to track food items on utensils, not requiring any change of camera perspective after installation. The shallow depth of utensils provides a more favorable angle for capturing food items, and tracking them on the utensil's surface offers a significantly more accurate estimation of dietary intake without the need for post-meal image capture. The system is reliable for estimation of nutritional content of liquid-solid heterogeneous mixtures such as soups and stews. Through a series of experiments, we demonstrate the exceptional potential of our method as a non-invasive, user-friendly, and highly accurate dietary intake monitoring tool.

How Much You Ate? Food Portion Estimation on Spoons

TL;DR

This work presents a utensil-centered, stationary front-facing camera approach for dietary portion estimation that tracks food on utensils through a three-stage pipeline: segmentation (via HQ-SAM with Grounding DINO or XMem VOS), key-frame detection, and volume estimation using geometric shapes (Prism, Ellipsoid, Hemisphere). It introduces a Spurious Segmentation Filtering mechanism and demonstrates that Ellipsoid fitting with XMem-based VOS and filtering achieves strong performance on ten rice-on-spoon videos with ground-truth volumes, underscoring the method's potential for non-invasive, accurate monitoring of liquid-solid meals like soups and stews. The study also discusses the trade-offs between independent-frame segmentation and video-object segmentation, and emphasizes practical considerations such as frame sampling, per-frame length mapping, and utensil-geometry priors. Overall, the approach offers a promising pathway toward accessible, front-end dietary monitoring suitable for aging-in-place and home-health contexts.

Abstract

Monitoring dietary intake is a crucial aspect of promoting healthy living. In recent years, advances in computer vision technology have facilitated dietary intake monitoring through the use of images and depth cameras. However, the current state-of-the-art image-based food portion estimation algorithms assume that users take images of their meals one or two times, which can be inconvenient and fail to capture food items that are not visible from a top-down perspective, such as ingredients submerged in a stew. To address these limitations, we introduce an innovative solution that utilizes stationary user-facing cameras to track food items on utensils, not requiring any change of camera perspective after installation. The shallow depth of utensils provides a more favorable angle for capturing food items, and tracking them on the utensil's surface offers a significantly more accurate estimation of dietary intake without the need for post-meal image capture. The system is reliable for estimation of nutritional content of liquid-solid heterogeneous mixtures such as soups and stews. Through a series of experiments, we demonstrate the exceptional potential of our method as a non-invasive, user-friendly, and highly accurate dietary intake monitoring tool.
Paper Structure (13 sections, 2 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 13 sections, 2 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Summary of the pipeline. Videos are processed either by Grounding DINO and HQ-SAM or a VOS model (XMem). VOS requires single-frame priming, which is conducted using a single segmentation instance from HQ-SAM. Segmented frames are then filtered by Algorithm \ref{['alg:ssf']} and three different methods of volumetric estimation are evaluated.
  • Figure 2: Examples of misclassified segmentation masks. In (a) the bowl has been improperly segmented as food. In (b) the hand has been improperly segmented as food. In (c) the independent frame segmentation method has mistakenly classifier the person as food!
  • Figure 3: The different shape fittings. In (a) the prism shape that is fit based on the average depth set as an assumption $3.81$ cm for a spoon, the parallel sides of the prism being the food segmentation masks themselves. (b) shows the hemisphere fit based on the area. Both (a), (b) do not consider the curved bottom part of the spoon as the excess volume included in the fitted curves accounts for this. (c) shows the ellipsoid model where we compute the length and width of the segmentation to find the ellipsoid's axis lengths. We add the volume of the bottom part of the spoon which is assumed to be 5 $cm^3$
  • Figure 4: Representation of the spoon in different frames of a video capture of a subject eating. Captures similar to this have been used to test the utensil tracking and volumetric estimation accuracy of our proposed approach.
  • Figure 5: Illustration of the reference length computation. We use the top curve of the spoon to find the first point from the tip of the spoon at which the slope of the curve is more than 15 degrees. In this diagram, the angle in blue is 30 degrees