Table of Contents
Fetching ...

Implicit-Scale 3D Reconstruction for Multi-Food Volume Estimation from Monocular Images

Yuhao Chen, Gautham Vinod, Siddeshwar Raghavan, Talha Ibn Mahmud, Bruce Coburn, Jinge Ma, Fengqing Zhu, Jiangpeng He

TL;DR

This work tackles the challenge of estimating food portions from a single monocular image in realistic dining scenes, where scale is ambiguous and multiple items co-exist. It introduces Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images as a benchmark, emphasizing implicit scale cues from plates and utensils instead of explicit references. The authors compare three methodology classes—Pixel-Space Heuristic Scaling, Scene-Level Geometric Prior Scaling, and Metric Depth–Driven Multi-Stage Scaling—showing that geometry-aware, depth-guided reconstruction (MDMS) yields the best volume accuracy (MAPE ≈ 0.21) and geometric fidelity (L1 Chamfer ≈ 5.7), outperforming appearance-based baselines. The results demonstrate the practical value of explicit 3D reconstruction for robust food portion estimation and aim to drive future geometry-aware methods in real-world dietary assessment.

Abstract

We present Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images, a benchmark dataset designed to advance geometry-based food portion estimation in realistic dining scenarios. Existing dietary assessment methods largely rely on single-image analysis or appearance-based inference, including recent vision-language models, which lack explicit geometric reasoning and are sensitive to scale ambiguity. This benchmark reframes food portion estimation as an implicit-scale 3D reconstruction problem under monocular observations. To reflect real-world conditions, explicit physical references and metric annotations are removed; instead, contextual objects such as plates and utensils are provided, requiring algorithms to infer scale from implicit cues and prior knowledge. The dataset emphasizes multi-food scenes with diverse object geometries, frequent occlusions, and complex spatial arrangements. The benchmark was adopted as a challenge at the MetaFood 2025 Workshop, where multiple teams proposed reconstruction-based solutions. Experimental results show that while strong vision--language baselines achieve competitive performance, geometry-based reconstruction methods provide both improved accuracy and greater robustness, with the top-performing approach achieving 0.21 MAPE in volume estimation and 5.7 L1 Chamfer Distance in geometric accuracy.

Implicit-Scale 3D Reconstruction for Multi-Food Volume Estimation from Monocular Images

TL;DR

This work tackles the challenge of estimating food portions from a single monocular image in realistic dining scenes, where scale is ambiguous and multiple items co-exist. It introduces Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images as a benchmark, emphasizing implicit scale cues from plates and utensils instead of explicit references. The authors compare three methodology classes—Pixel-Space Heuristic Scaling, Scene-Level Geometric Prior Scaling, and Metric Depth–Driven Multi-Stage Scaling—showing that geometry-aware, depth-guided reconstruction (MDMS) yields the best volume accuracy (MAPE ≈ 0.21) and geometric fidelity (L1 Chamfer ≈ 5.7), outperforming appearance-based baselines. The results demonstrate the practical value of explicit 3D reconstruction for robust food portion estimation and aim to drive future geometry-aware methods in real-world dietary assessment.

Abstract

We present Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images, a benchmark dataset designed to advance geometry-based food portion estimation in realistic dining scenarios. Existing dietary assessment methods largely rely on single-image analysis or appearance-based inference, including recent vision-language models, which lack explicit geometric reasoning and are sensitive to scale ambiguity. This benchmark reframes food portion estimation as an implicit-scale 3D reconstruction problem under monocular observations. To reflect real-world conditions, explicit physical references and metric annotations are removed; instead, contextual objects such as plates and utensils are provided, requiring algorithms to infer scale from implicit cues and prior knowledge. The dataset emphasizes multi-food scenes with diverse object geometries, frequent occlusions, and complex spatial arrangements. The benchmark was adopted as a challenge at the MetaFood 2025 Workshop, where multiple teams proposed reconstruction-based solutions. Experimental results show that while strong vision--language baselines achieve competitive performance, geometry-based reconstruction methods provide both improved accuracy and greater robustness, with the top-performing approach achieving 0.21 MAPE in volume estimation and 5.7 L1 Chamfer Distance in geometric accuracy.
Paper Structure (8 sections, 2 figures, 3 tables)

This paper contains 8 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Implicit-scale 3D reconstruction of multi-food. Users are provided with a single image containing a realistic multi-food eating scenario, including multiple food items, plates and utensils with out explicit scale reference.
  • Figure 2: Cropped examples from our benchmark dataset. Original images include a wider field of view containing utensils and surrounding context.