Table of Contents
Fetching ...

Self-Aligning Depth-regularized Radiance Fields for Asynchronous RGB-D Sequences

Yuxin Huang, Andong Yang, Zirui Wu, Yuantao Chen, Runyi Yang, Zhenxin Zhu, Chao Hou, Hao Zhao, Guyue Zhou

TL;DR

This work tackles learning depth-regularized neural radiance fields from asynchronous RGB-D sequences captured by UAVs, where RGB and depth streams are not temporally synchronized. It introduces a time-pose function $\phi$ that maps timestamps to poses in $SE(3)$, enabling depth supervision to regularize large-scale NeRFs through a cascaded, fully differentiable architecture. The approach boots a large-scale RGB-only radiance field and then jointly optimizes $\theta$ and $\phi$ with RGB-D supervision, supported by a synthetic AUS dataset and real-world drone experiments that demonstrate improved novel-view rendering and depth estimation over baselines. This enables robust city-scale scene reconstruction from asynchronous UAV data, with broad implications for photorealistic rendering and 3D mapping in autonomous aerial missions.

Abstract

It has been shown that learning radiance fields with depth rendering and depth supervision can effectively promote the quality and convergence of view synthesis. However, this paradigm requires input RGB-D sequences to be synchronized, hindering its usage in the UAV city modeling scenario. As there exists asynchrony between RGB images and depth images due to high-speed flight, we propose a novel time-pose function, which is an implicit network that maps timestamps to $\rm SE(3)$ elements. To simplify the training process, we also design a joint optimization scheme to jointly learn the large-scale depth-regularized radiance fields and the time-pose function. Our algorithm consists of three steps: (1) time-pose function fitting, (2) radiance field bootstrapping, (3) joint pose error compensation and radiance field refinement. In addition, we propose a large synthetic dataset with diverse controlled mismatches and ground truth to evaluate this new problem setting systematically. Through extensive experiments, we demonstrate that our method outperforms baselines without regularization. We also show qualitatively improved results on a real-world asynchronous RGB-D sequence captured by drone. Codes, data, and models will be made publicly available.

Self-Aligning Depth-regularized Radiance Fields for Asynchronous RGB-D Sequences

TL;DR

This work tackles learning depth-regularized neural radiance fields from asynchronous RGB-D sequences captured by UAVs, where RGB and depth streams are not temporally synchronized. It introduces a time-pose function that maps timestamps to poses in , enabling depth supervision to regularize large-scale NeRFs through a cascaded, fully differentiable architecture. The approach boots a large-scale RGB-only radiance field and then jointly optimizes and with RGB-D supervision, supported by a synthetic AUS dataset and real-world drone experiments that demonstrate improved novel-view rendering and depth estimation over baselines. This enables robust city-scale scene reconstruction from asynchronous UAV data, with broad implications for photorealistic rendering and 3D mapping in autonomous aerial missions.

Abstract

It has been shown that learning radiance fields with depth rendering and depth supervision can effectively promote the quality and convergence of view synthesis. However, this paradigm requires input RGB-D sequences to be synchronized, hindering its usage in the UAV city modeling scenario. As there exists asynchrony between RGB images and depth images due to high-speed flight, we propose a novel time-pose function, which is an implicit network that maps timestamps to elements. To simplify the training process, we also design a joint optimization scheme to jointly learn the large-scale depth-regularized radiance fields and the time-pose function. Our algorithm consists of three steps: (1) time-pose function fitting, (2) radiance field bootstrapping, (3) joint pose error compensation and radiance field refinement. In addition, we propose a large synthetic dataset with diverse controlled mismatches and ground truth to evaluate this new problem setting systematically. Through extensive experiments, we demonstrate that our method outperforms baselines without regularization. We also show qualitatively improved results on a real-world asynchronous RGB-D sequence captured by drone. Codes, data, and models will be made publicly available.
Paper Structure (13 sections, 9 equations, 6 figures, 4 tables)

This paper contains 13 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: To learn a depth-regularized radiance field using (a) asynchronous RGB-D sequences, we propose a (b) time-pose function to map from timestamp to camera pose. For a (c) novel view, our method can render a better (d) depth map than (e) Mega-NeRF.
  • Figure 2: Method Pipeline. The time-pose function is modeled using a 1-D multi-resolution hash grid with direct and speed losses. After bootstrapping the scene representation networks with pure RGB signals, the predicted depth sensor poses are used for jointly optimizing the NeRFs' parameters $\theta$. At each timestamp ($t_i$ from RGB sequence or $t_j$ from depth sequence), only one modality of sensor signals is provided, thus only one loss term is activated (shown on the right).
  • Figure 3: Our implementation of the Time-Pose Function with a multi-resolution hash grid. Blue and orange networks are of different resolution.
  • Figure 4: Three-step Optimization. (i) A time-pose function $\phi$ is trained to predict camera poses from timestamps; (ii) The neural radiance field parameterized by $\theta$ is bootstrapped with pure RGB losses; (iii) Both of the parameters $\theta$, $\phi$ are jointly optimized with RGB-D supervision.
  • Figure 5: We propose a photo-realistically rendered dataset named Asynchronous Urban Scene (AUS) for evaluation. (a/b) are large-scale city scenes designed according to New York and San Francisco while (c) is (relatively) small-scale scenes provided by UrbanScene3D. Drone trajectories of different difficulty levels are visualized in (a-c). On these trajectories, we first capture an RGB-D sequence with an enough high framerate. Then we exploit two resampling strategies: fixed offset (d) and random offset (e). $x$ equals $30$ in (d) for every RGB-D pair. $x$ equals $30$ while $y$ equals $50$ in (e).
  • ...and 1 more figures