Table of Contents
Fetching ...

Lightweight Neural Framework for Robust 3D Volume and Surface Estimation from Multi-View Images

Diego Eustachio Farchione, Ramzi Idoughi, Peter Wonka

TL;DR

This work tackles robustly estimating object volume $V$ and surface area $A$ from multi-view images by an end-to-end framework that fuses 3D point-cloud geometry with 2D DINOv3 features. It replaces heavy mesh post-processing with a lightweight fusion decoder that yields $V$ and $A$ along with calibrated uncertainties using a Gaussian NLL-based loss, enabling accurate predictions under sparse or noisy views. The approach is pretrained on a large synthetic corpus (Objaverse) and fine-tuned on domain-specific datasets (synthetic corals, THuman, MetaFood3D), achieving strong cross-domain performance in corals, foods, and humans, and offering fast inference relative to traditional reconstruction pipelines. The method demonstrates robust generalization with limited real data, supports real-time-like deployment, and highlights practical applications in reef monitoring, dietary assessment, and anthropometry, while also acknowledging the need for external scale cues for absolute measurements.

Abstract

Accurate estimation of object volume and surface area from visual data is an open challenge with broad implications across various domains. We propose a unified framework that predicts volumetric and surface metrics directly from a set of 2D multi-view images. Our approach first generates a point cloud from the captured multi-view images using recent 3D reconstruction techniques, while a parallel 2D encoder aggregates view-aligned features. A fusion module then aligns and merges 3D geometry with 2D visual embeddings, followed by a graph-based decoder that regresses volume, surface area, and their corresponding uncertainties. This proposed architecture maintains robustness against sparse or noisy data. We evaluate the framework across multiple application domains: corals, where precise geometric measurements support growth monitoring; food items, where volume prediction relates to dietary tracking and portion analysis; and human bodies, where volumetric cues are crucial for anthropometric and medical applications. Experimental results demonstrate the reliable performance of our framework across diverse scenarios, highlighting its versatility and adaptability. Furthermore, by coupling 3D reconstruction with neural regression and 2D features, our model provides a scalable and fast solution for quantitative shape analysis from visual data.

Lightweight Neural Framework for Robust 3D Volume and Surface Estimation from Multi-View Images

TL;DR

This work tackles robustly estimating object volume and surface area from multi-view images by an end-to-end framework that fuses 3D point-cloud geometry with 2D DINOv3 features. It replaces heavy mesh post-processing with a lightweight fusion decoder that yields and along with calibrated uncertainties using a Gaussian NLL-based loss, enabling accurate predictions under sparse or noisy views. The approach is pretrained on a large synthetic corpus (Objaverse) and fine-tuned on domain-specific datasets (synthetic corals, THuman, MetaFood3D), achieving strong cross-domain performance in corals, foods, and humans, and offering fast inference relative to traditional reconstruction pipelines. The method demonstrates robust generalization with limited real data, supports real-time-like deployment, and highlights practical applications in reef monitoring, dietary assessment, and anthropometry, while also acknowledging the need for external scale cues for absolute measurements.

Abstract

Accurate estimation of object volume and surface area from visual data is an open challenge with broad implications across various domains. We propose a unified framework that predicts volumetric and surface metrics directly from a set of 2D multi-view images. Our approach first generates a point cloud from the captured multi-view images using recent 3D reconstruction techniques, while a parallel 2D encoder aggregates view-aligned features. A fusion module then aligns and merges 3D geometry with 2D visual embeddings, followed by a graph-based decoder that regresses volume, surface area, and their corresponding uncertainties. This proposed architecture maintains robustness against sparse or noisy data. We evaluate the framework across multiple application domains: corals, where precise geometric measurements support growth monitoring; food items, where volume prediction relates to dietary tracking and portion analysis; and human bodies, where volumetric cues are crucial for anthropometric and medical applications. Experimental results demonstrate the reliable performance of our framework across diverse scenarios, highlighting its versatility and adaptability. Furthermore, by coupling 3D reconstruction with neural regression and 2D features, our model provides a scalable and fast solution for quantitative shape analysis from visual data.

Paper Structure

This paper contains 37 sections, 8 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Teaser: Our end-to-end pipeline for robust volume and surface area estimation from multi-view images is demonstrated on corals from the CoralVOS dataset ziqiang2023coralvos: Using only five top-view RGB images from monocular video, our model can predict normalized volume and surface with their corresponding confidence, showcasing its potential for efficient, real-world applications.
  • Figure 2: From masked multi-view images from meshes (or original images), a 3D reconstructor like MapAnything produces a fused point cloud with per-point confidence, while a frozen DINOv3 encoder extracts view-aligned 2D features. In the fusion decoder, a lightweight DGCNN summarizes the point cloud, followed by max pooling layer. In the 2D branch, per-view features pass through a small FC block and are mean-max pooled across views. The fusion decoder concatenates the 3D and 2D descriptors for each target and feeds a small FC regressor that outputs Volume and Surface along with their Confidence (log-variance). A composite loss (Gaussian NLL in linear/log domains + MAPE) trains the heads end-to-end, yielding single-pass volume/surface estimates without the need to reconstruct watertight meshes or applying heavy post-processing.
  • Figure 3: Representative samples across datasets. Top: food items from the MetaFood dataset. Middle: human subjects from THuman2.1. Bottom: synthetic coral specimens generated with Infinigen and ManifoldPlus. These examples highlight the geometric and visual diversity of the domains on which our framework performs unified volume and surface estimation.
  • Figure 4: Watertight post-processing and low-volume sensitivity.Top: original reconstructed mesh. Bottom: watertight mesh. Enforcing watertightness introduces artifacts (spurious bridges and corrugations) and biases the volume downward on this sample. Because the object has relatively low volume, even small geometric artifacts, whether due to watertightting or ordinary reconstruction noise, can disproportionately increase per-sample MAPE, inflating the overall mean and widening the spread around the median.
  • Figure S1: Example of segmentation inconsistencies in the MetaFood dataset. While some masks correctly isolate the full object (right), others segment only the food portion inside a container (left). This ambiguity arises because, in certain classes, the ground-truth volume refers to both the container and its contents, while the mask includes only the food. Manual quality assurance (QA) was therefore required to ensure consistency across samples. Representative segmentation examples were generated using Grounded-SAM2.
  • ...and 5 more figures