Table of Contents
Fetching ...

Dark3R: Learning Structure from Motion in the Dark

Andrew Y Guo, Anagh Malik, SaiKiran Tedla, Yutong Dai, Yiqian Qin, Zach Salehe, Benjamin Attal, Sotiris Nousias, Kyros Kutulakos, David B. Lindell

Abstract

We introduce Dark3R, a framework for structure from motion in the dark that operates directly on raw images with signal-to-noise ratios (SNRs) below $-4$ dB -- a regime where conventional feature- and learning-based methods break down. Our key insight is to adapt large-scale 3D foundation models to extreme low-light conditions through a teacher--student distillation process, enabling robust feature matching and camera pose estimation in low light. Dark3R requires no 3D supervision; it is trained solely on noisy--clean raw image pairs, which can be either captured directly or synthesized using a simple Poisson--Gaussian noise model applied to well-exposed raw images. To train and evaluate our approach, we introduce a new, exposure-bracketed dataset that includes $\sim$42,000 multi-view raw images with ground-truth 3D annotations, and we demonstrate that Dark3R achieves state-of-the-art structure from motion in the low-SNR regime. Further, we demonstrate state-of-the-art novel view synthesis in the dark using Dark3R's predicted poses and a coarse-to-fine radiance field optimization procedure.

Dark3R: Learning Structure from Motion in the Dark

Abstract

We introduce Dark3R, a framework for structure from motion in the dark that operates directly on raw images with signal-to-noise ratios (SNRs) below dB -- a regime where conventional feature- and learning-based methods break down. Our key insight is to adapt large-scale 3D foundation models to extreme low-light conditions through a teacher--student distillation process, enabling robust feature matching and camera pose estimation in low light. Dark3R requires no 3D supervision; it is trained solely on noisy--clean raw image pairs, which can be either captured directly or synthesized using a simple Poisson--Gaussian noise model applied to well-exposed raw images. To train and evaluate our approach, we introduce a new, exposure-bracketed dataset that includes 42,000 multi-view raw images with ground-truth 3D annotations, and we demonstrate that Dark3R achieves state-of-the-art structure from motion in the low-SNR regime. Further, we demonstrate state-of-the-art novel view synthesis in the dark using Dark3R's predicted poses and a coarse-to-fine radiance field optimization procedure.
Paper Structure (47 sections, 7 equations, 12 figures, 12 tables)

This paper contains 47 sections, 7 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Dark3R enables structure from motion and novel view synthesis from raw images captured in low-light conditions. (a) For this scene, we capture 500 images from varying viewpoints and show a subset along with their signal-to-noise ratios. Temporal sensor noise causes pronounced frame-to-frame color variations, visible in the bottom row, which further complicates the problem. (b) We apply Dark3R to these images to recover camera poses and the 3D scene geometry (we show a subset of the predicted poses). (c) Finally, we introduce a robust view-synthesis technique that leverages Dark3R’s predicted poses and a coarse-to-fine optimization strategy to reconstruct fine appearance details that are otherwise completely obscured by noise. Please refer to the project webpage for video results.
  • Figure 2: Existing hand-crafted and data-driven feature-matching pipelines such as SuperGlue sarlin2020superglue and MASt3R leroy2024grounding perform reliably under well-illuminated conditions (top row) but performance significantly worsens when the image signal-to-noise ratio (SNR) drops to below –3 dB (bottom row). In contrast, Dark3R robustly identifies corresponding points in both imaging regimes. Green and red lines denote correspondences whose symmetric epipolar distance (SED) is below or above two pixels, respectively, for a randomly selected set of 20 putative matches. The average SED over all matches is also reported. We compute SED using calibrated camera intrinsics and the essential matrix predicted from MASt3R correspondences on the high-SNR image pair.
  • Figure 3: Method overview. (a) Dark3R is trained using paired clean and noisy raw images, $(\mathbf{I}_\text{clean}^{({1})}, \mathbf{I}_\text{clean}^{({2})})$ and $(\mathbf{I}_\text{noisy}^{({1})}, \mathbf{I}_\text{noisy}^{({2})})$. The model is initialized from the weights of a pretrained MASt3R leroy2024grounding network and adapted to low-light conditions using low-rank adaptation hu2022lora. We fine-tune the encoder $\mathcal{\tilde{E}}$, decoder $\mathcal{\tilde{D}}$, and output head $\mathcal{\tilde{H}}$. We supervise training by minimizing the difference between MASt3R’s encoder features $\mathbf{F}_\mathcal{E}$, decoder features $\mathbf{F}_\mathcal{D}$, and correspondence map $\mathbf{C}$ from the clean pair and Dark3R’s predictions $\mathbf{\tilde{F}}_\mathcal{E}$, $\mathbf{\tilde{F}}_\mathcal{D}$, and $\mathbf{\tilde{C}}$ on the noisy pair. (b) After training, the predicted poses and depth maps from Dark3R enable view synthesis in the dark via a coarse-to-fine optimization process. The rendered novel views are passed through an image signal processor (ISP) to produce the final sRGB outputs.
  • Figure 4: Pose prediction assessment. We compare Dark3R (blue) against MASt3R-SfM duisterhof2025mast3r (gray) across four pose and depth metrics: relative pose error in translation (RPE tran.), relative pose error in rotation (RPE rot.), absolute relative depth error (AbsRel), and the accuracy threshold $\delta < 1.25$. Each curve shows mean performance with shaded regions indicating standard deviation across scenes. As image SNR decreases, Dark3R maintains lower pose and depth errors and higher reconstruction accuracy compared to MASt3R-SfM.
  • Figure 5: We compare point clouds and camera poses recovered by Dark3R with those estimated by MASt3R-SfM duisterhof2025mast3r on two low-light scenes. Dark3R produces more accurate geometry and camera trajectories that better align with a reference solution obtained by running COLMAP schonberger2016structure on well-exposed images captured from the same viewpoints as the noisy inputs.
  • ...and 7 more figures