Lightweight Neural Framework for Robust 3D Volume and Surface Estimation from Multi-View Images
Diego Eustachio Farchione, Ramzi Idoughi, Peter Wonka
TL;DR
This work tackles robustly estimating object volume $V$ and surface area $A$ from multi-view images by an end-to-end framework that fuses 3D point-cloud geometry with 2D DINOv3 features. It replaces heavy mesh post-processing with a lightweight fusion decoder that yields $V$ and $A$ along with calibrated uncertainties using a Gaussian NLL-based loss, enabling accurate predictions under sparse or noisy views. The approach is pretrained on a large synthetic corpus (Objaverse) and fine-tuned on domain-specific datasets (synthetic corals, THuman, MetaFood3D), achieving strong cross-domain performance in corals, foods, and humans, and offering fast inference relative to traditional reconstruction pipelines. The method demonstrates robust generalization with limited real data, supports real-time-like deployment, and highlights practical applications in reef monitoring, dietary assessment, and anthropometry, while also acknowledging the need for external scale cues for absolute measurements.
Abstract
Accurate estimation of object volume and surface area from visual data is an open challenge with broad implications across various domains. We propose a unified framework that predicts volumetric and surface metrics directly from a set of 2D multi-view images. Our approach first generates a point cloud from the captured multi-view images using recent 3D reconstruction techniques, while a parallel 2D encoder aggregates view-aligned features. A fusion module then aligns and merges 3D geometry with 2D visual embeddings, followed by a graph-based decoder that regresses volume, surface area, and their corresponding uncertainties. This proposed architecture maintains robustness against sparse or noisy data. We evaluate the framework across multiple application domains: corals, where precise geometric measurements support growth monitoring; food items, where volume prediction relates to dietary tracking and portion analysis; and human bodies, where volumetric cues are crucial for anthropometric and medical applications. Experimental results demonstrate the reliable performance of our framework across diverse scenarios, highlighting its versatility and adaptability. Furthermore, by coupling 3D reconstruction with neural regression and 2D features, our model provides a scalable and fast solution for quantitative shape analysis from visual data.
