Table of Contents
Fetching ...

LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models

Ziqi Lu, Heng Yang, Danfei Xu, Boyi Li, Boris Ivanovic, Marco Pavone, Yue Wang

TL;DR

LoRA3D addresses the scarcity of labeled 3D data for enhancing pre-trained 3D foundation models by proposing a fast self-calibration pipeline that specializes a model to a target scene using only sparse RGB views. It combines a robust, confidence-calibrated multi-view optimization to produce high-quality pseudo labels and fine-tunes with lightweight, rank-based LoRA adapters, enabling near real-time adaptation on a single GPU. The approach delivers up to 88% improvements across 161 scenes in 3D reconstruction, multi-view pose estimation, and novel-view rendering, and generalizes to MASt3R as well. This work offers a practical, scalable path to deploy robust 3D perception in-the-wild without manual labeling or external priors, with strong implications for rapid scene-specific adaptation in robotics and AR/VR applications.

Abstract

Emerging 3D geometric foundation models, such as DUSt3R, offer a promising approach for in-the-wild 3D vision tasks. However, due to the high-dimensional nature of the problem space and scarcity of high-quality 3D data, these pre-trained models still struggle to generalize to many challenging circumstances, such as limited view overlap or low lighting. To address this, we propose LoRA3D, an efficient self-calibration pipeline to $\textit{specialize}$ the pre-trained models to target scenes using their own multi-view predictions. Taking sparse RGB images as input, we leverage robust optimization techniques to refine multi-view predictions and align them into a global coordinate frame. In particular, we incorporate prediction confidence into the geometric optimization process, automatically re-weighting the confidence to better reflect point estimation accuracy. We use the calibrated confidence to generate high-quality pseudo labels for the calibrating views and use low-rank adaptation (LoRA) to fine-tune the models on the pseudo-labeled data. Our method does not require any external priors or manual labels. It completes the self-calibration process on a $\textbf{single standard GPU within just 5 minutes}$. Each low-rank adapter requires only $\textbf{18MB}$ of storage. We evaluated our method on $\textbf{more than 160 scenes}$ from the Replica, TUM and Waymo Open datasets, achieving up to $\textbf{88% performance improvement}$ on 3D reconstruction, multi-view pose estimation and novel-view rendering.

LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models

TL;DR

LoRA3D addresses the scarcity of labeled 3D data for enhancing pre-trained 3D foundation models by proposing a fast self-calibration pipeline that specializes a model to a target scene using only sparse RGB views. It combines a robust, confidence-calibrated multi-view optimization to produce high-quality pseudo labels and fine-tunes with lightweight, rank-based LoRA adapters, enabling near real-time adaptation on a single GPU. The approach delivers up to 88% improvements across 161 scenes in 3D reconstruction, multi-view pose estimation, and novel-view rendering, and generalizes to MASt3R as well. This work offers a practical, scalable path to deploy robust 3D perception in-the-wild without manual labeling or external priors, with strong implications for rapid scene-specific adaptation in robotics and AR/VR applications.

Abstract

Emerging 3D geometric foundation models, such as DUSt3R, offer a promising approach for in-the-wild 3D vision tasks. However, due to the high-dimensional nature of the problem space and scarcity of high-quality 3D data, these pre-trained models still struggle to generalize to many challenging circumstances, such as limited view overlap or low lighting. To address this, we propose LoRA3D, an efficient self-calibration pipeline to the pre-trained models to target scenes using their own multi-view predictions. Taking sparse RGB images as input, we leverage robust optimization techniques to refine multi-view predictions and align them into a global coordinate frame. In particular, we incorporate prediction confidence into the geometric optimization process, automatically re-weighting the confidence to better reflect point estimation accuracy. We use the calibrated confidence to generate high-quality pseudo labels for the calibrating views and use low-rank adaptation (LoRA) to fine-tune the models on the pseudo-labeled data. Our method does not require any external priors or manual labels. It completes the self-calibration process on a . Each low-rank adapter requires only of storage. We evaluated our method on from the Replica, TUM and Waymo Open datasets, achieving up to on 3D reconstruction, multi-view pose estimation and novel-view rendering.

Paper Structure

This paper contains 32 sections, 15 equations, 12 figures, 14 tables.

Figures (12)

  • Figure 1: Given sparse RGB images, our self-calibration pipeline efficiently specializes a pre-trained 3D foundation model to a target scene to improve its performance for a variety of 3D vision tasks.
  • Figure 2: Overview of our self-calibration pipeline. (a) Predict: We pair sparse input RGB images and use the pre-trained 3D foundation model to predict per-pair point maps and confidence maps. (b) Robust Global Optimization: We apply robust optimization techniques to concurrently refine multi-view point predictions and calibrate prediction confidence. (c) Confidence-Based Pseudo-Labeling: Refined point maps with high calibrated confidence are used to generate pseudo-labels on calibration views. (d) LoRA Fine-Tuning: Using the pseudo-labeled data, we efficiently fine-tune the pre-trained model with LoRA. While the figure illustrates our method using DUSt3R, our approach generalizes to other 3D foundation models.
  • Figure 3: Pre-trained DUSt3R's (b) prediction confidence and (c) error map on (a) an example image pair: In cases of limited visual overlap, DUSt3R may produce overconfident predictions (★). Our robust multi-view alignment method effectively reduces this overconfidence, maintaining high confidence for accurate predictions (+, $\times$) and low confidence for outlier predictions ($\bullet$).
  • Figure 4: Pseudo-labeling with (a) calibrated confidence, which is a good measure of the (b) point estimation accuracy. We select high-calibrated-confidence point predictions as pseudo labels (d) for DUSt3R finetuning.
  • Figure 5: What is the best DUSt3R fine-tuning strategy? We plot the mean prediction errors on test images against the number of trainable parameters for various fine-tuning options on an example test scene (Replica "office0"). We found adapting all attention weights with rank-16 LoRA (i.e. ★) achieves the best trade-off between performance and efficiency on most test scenes.
  • ...and 7 more figures