Table of Contents
Fetching ...

UniScale: Unified Scale-Aware 3D Reconstruction for Multi-View Understanding via Prior Injection for Robotic Perception

Mohammad Mahdavian, Gordon Tan, Binbin Xu, Yuan Ren, Dongfeng Bai, Bingbing Liu

TL;DR

UniScale is presented, a unified, scale-aware multi-view 3D reconstruction framework for robotic applications that flexibly integrates geometric priors through a modular, semantically informed design and enables robust, metric-aware 3D reconstruction within a single unified model.

Abstract

We present UniScale, a unified, scale-aware multi-view 3D reconstruction framework for robotic applications that flexibly integrates geometric priors through a modular, semantically informed design. In vision-based robotic navigation, the accurate extraction of environmental structure from raw image sequences is critical for downstream tasks. UniScale addresses this challenge with a single feed-forward network that jointly estimates camera intrinsics and extrinsics, scale-invariant depth and point maps, and the metric scale of a scene from multi-view images, while optionally incorporating auxiliary geometric priors when available. By combining global contextual reasoning with camera-aware feature representations, UniScale is able to recover the metric-scale of the scene. In robotic settings where camera intrinsics are known, they can be easily incorporated to improve performance, with additional gains obtained when camera poses are also available. This co-design enables robust, metric-aware 3D reconstruction within a single unified model. Importantly, UniScale does not require training from scratch, and leverages world priors exhibited in pre-existing models without geometric encoding strategies, making it particularly suitable for resource-constrained robotic teams. We evaluate UniScale on multiple benchmarks, demonstrating strong generalization and consistent performance across diverse environments. We will release our implementation upon acceptance.

UniScale: Unified Scale-Aware 3D Reconstruction for Multi-View Understanding via Prior Injection for Robotic Perception

TL;DR

UniScale is presented, a unified, scale-aware multi-view 3D reconstruction framework for robotic applications that flexibly integrates geometric priors through a modular, semantically informed design and enables robust, metric-aware 3D reconstruction within a single unified model.

Abstract

We present UniScale, a unified, scale-aware multi-view 3D reconstruction framework for robotic applications that flexibly integrates geometric priors through a modular, semantically informed design. In vision-based robotic navigation, the accurate extraction of environmental structure from raw image sequences is critical for downstream tasks. UniScale addresses this challenge with a single feed-forward network that jointly estimates camera intrinsics and extrinsics, scale-invariant depth and point maps, and the metric scale of a scene from multi-view images, while optionally incorporating auxiliary geometric priors when available. By combining global contextual reasoning with camera-aware feature representations, UniScale is able to recover the metric-scale of the scene. In robotic settings where camera intrinsics are known, they can be easily incorporated to improve performance, with additional gains obtained when camera poses are also available. This co-design enables robust, metric-aware 3D reconstruction within a single unified model. Importantly, UniScale does not require training from scratch, and leverages world priors exhibited in pre-existing models without geometric encoding strategies, making it particularly suitable for resource-constrained robotic teams. We evaluate UniScale on multiple benchmarks, demonstrating strong generalization and consistent performance across diverse environments. We will release our implementation upon acceptance.
Paper Structure (37 sections, 7 equations, 6 figures, 3 tables)

This paper contains 37 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: UniScale Overview. Upon receiving a set of images with optional camera intrinsic and extrinsic information, UniScale generates depth and point maps, metric-scale and auxilliary camera information, all of which may be used in 3D reconstruction for downstream robotic tasks.
  • Figure 2: (a) Overview of the UniScale architecture and (b) Architecture of the Scale Head. The model combines global contextual information from class tokens, camera intrinsics and extrinsics encoded in camera tokens, and image features from aggregated patch tokens to predict the scene-level scale value.
  • Figure 3: Comparison between UniScale and other SOTA methods on modified dense-$N$-view benchmark. UniScale demonstrates better or comparable dense multi-view reconstruction for number of input views varying from 2 to 50.
  • Figure 4: Qualitative Comparison - Oxford Spires Dataset tao2025spires
  • Figure 5: Qualitative Reconstruction-EuRoC MAV Dataset Burri:etal:IJRR2016
  • ...and 1 more figures