Table of Contents
Fetching ...

VolETA: One- and Few-shot Food Volume Estimation

Ahmad AlMughrabi, Umair Haroon, Ricardo Marques, Petia Radeva

TL;DR

VolETA addresses the challenge of estimating food volume from casual RGBD imagery by leveraging a hybrid one- and few-shot 3D reconstruction pipeline. It combines keyframe selection, PixSfM-based camera pose estimation, SAM-based reference segmentation, XMem++ tracking, and NeuS2 neural surface reconstruction to generate scaled 3D food meshes; subsequent scaling refinement uses MeshLab measurements and depth cues. On the MTF dataset, VolETA achieves a mean MAPE of 10.97% with robust shape accuracy, demonstrating resilience to occlusions and variable lighting. The approach enables practical, automated volumetric nutrition assessment from limited input, with potential impact on dietary monitoring and computational nutrition research.

Abstract

Accurate food volume estimation is essential for dietary assessment, nutritional tracking, and portion control applications. We present VolETA, a sophisticated methodology for estimating food volume using 3D generative techniques. Our approach creates a scaled 3D mesh of food objects using one- or few-RGBD images. We start by selecting keyframes based on the RGB images and then segmenting the reference object in the RGB images using XMem++. Simultaneously, camera positions are estimated and refined using the PixSfM technique. The segmented food images, reference objects, and camera poses are combined to form a data model suitable for NeuS2. Independent mesh reconstructions for reference and food objects are carried out, with scaling factors determined using MeshLab based on the reference object. Moreover, depth information is used to fine-tune the scaling factors by estimating the potential volume range. The fine-tuned scaling factors are then applied to the cleaned food meshes for accurate volume measurements. Similarly, we enter a segmented RGB image to the One-2-3-45 model for one-shot food volume estimation, resulting in a mesh. We then leverage the obtained scaling factors to the cleaned food mesh for accurate volume measurements. Our experiments show that our method effectively addresses occlusions, varying lighting conditions, and complex food geometries, achieving robust and accurate volume estimations with 10.97% MAPE using the MTF dataset. This innovative approach enhances the precision of volume assessments and significantly contributes to computational nutrition and dietary monitoring advancements.

VolETA: One- and Few-shot Food Volume Estimation

TL;DR

VolETA addresses the challenge of estimating food volume from casual RGBD imagery by leveraging a hybrid one- and few-shot 3D reconstruction pipeline. It combines keyframe selection, PixSfM-based camera pose estimation, SAM-based reference segmentation, XMem++ tracking, and NeuS2 neural surface reconstruction to generate scaled 3D food meshes; subsequent scaling refinement uses MeshLab measurements and depth cues. On the MTF dataset, VolETA achieves a mean MAPE of 10.97% with robust shape accuracy, demonstrating resilience to occlusions and variable lighting. The approach enables practical, automated volumetric nutrition assessment from limited input, with potential impact on dietary monitoring and computational nutrition research.

Abstract

Accurate food volume estimation is essential for dietary assessment, nutritional tracking, and portion control applications. We present VolETA, a sophisticated methodology for estimating food volume using 3D generative techniques. Our approach creates a scaled 3D mesh of food objects using one- or few-RGBD images. We start by selecting keyframes based on the RGB images and then segmenting the reference object in the RGB images using XMem++. Simultaneously, camera positions are estimated and refined using the PixSfM technique. The segmented food images, reference objects, and camera poses are combined to form a data model suitable for NeuS2. Independent mesh reconstructions for reference and food objects are carried out, with scaling factors determined using MeshLab based on the reference object. Moreover, depth information is used to fine-tune the scaling factors by estimating the potential volume range. The fine-tuned scaling factors are then applied to the cleaned food meshes for accurate volume measurements. Similarly, we enter a segmented RGB image to the One-2-3-45 model for one-shot food volume estimation, resulting in a mesh. We then leverage the obtained scaling factors to the cleaned food mesh for accurate volume measurements. Our experiments show that our method effectively addresses occlusions, varying lighting conditions, and complex food geometries, achieving robust and accurate volume estimations with 10.97% MAPE using the MTF dataset. This innovative approach enhances the precision of volume assessments and significantly contributes to computational nutrition and dietary monitoring advancements.
Paper Structure (14 sections, 1 equation, 7 figures, 3 tables)

This paper contains 14 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Our few-shot approach for estimating food volume in (a) a few shots involves taking ($\mathcal{I}^D$) and food object masks as input. We start by selecting keyframes based on the RGB images, removing blurry and highly overlapped images resulting ($I^K$). Then, (b) we use PixSfM to estimate camera poses ($C$). Simultaneously, we segment the reference object using SAM with a segmentation prompt provided by a user. We then use the XMem++ method for memory-tracking to produce reference object masks for all frames, using the reference object mask and RGB images. After that, we apply a binary image segmentation method to RGB images ($I^K$), reference object masks ($M_r$), and food object masks ($M_f$), resulting in RGBA images ($I^R_r$). In contrast, we transform the RGBA images and poses to generate meaningful metadata and create modeled data ($D_m$). Next, (c) we input the modeled data into NeuS2 to reconstruct colorful meshes for reference ($R_r$) and food objects ($R_f$). To ensure accuracy, we use "Remove Isolated Pieces" with diameter thresholding to clean up the mesh and remove small isolated pieces that do not belong to the reference or food mesh resulting ($\{R^C_r, R^C_f\}$). Finally, we manually identify the scaling factor using the reference mesh via MeshLab ($S$). We fine-tune the scaling factor using depth information and the food masks and then apply the fine-tuned scaling factor ($S_f$) to the cleaned food mesh to generate a scaled food mesh ($R^F_f$) in meter unit.
  • Figure 2: We manually measure the scaling factor using MeshLab's Measuring tool. We measure multiple blocks in the reference object mesh; then, we take the average of blocks lengths $l_{avg}$.
  • Figure 3: Our one-shot food volume estimation approach. We begin with a food-segmented image ($I^R_f$), and then we use the One-2-3-45 model to generate a mesh ($R_f$). Next, we clean up the isolated pieces that are less than 5% of the ($R_f$) size, resulting in a cleaned food mesh $R^C_f$. Furthermore, we choose a scaling factor based on the depth information $S_f$. Finally, we apply the chosen scaling factor on $R^C_f$ to have a scaled mesh ($R^F_f$) where we extract the volume.
  • Figure 4: Comparisons to ours and ground truth using the MTF dataset. Each scene shows our reconstruction (left) and ground truth (right).
  • Figure 5: A quantitative results to the number of frames before and after the keyframe selection phase. Our approach is only using 34.8% of the data.
  • ...and 2 more figures