Table of Contents
Fetching ...

DyBluRF: Dynamic Neural Radiance Fields from Blurry Monocular Video

Huiqiang Sun, Xingyi Li, Liao Shen, Xinyi Ye, Ke Xian, Zhiguo Cao

TL;DR

DyBluRF addresses the challenge of generating sharp novel views for dynamic scenes captured in motion-blurred monocular video. It jointly models camera trajectories across exposure timestamps and global 3D object motion via learnable DCT trajectories, with a static branch and cross-time rendering to enforce temporal coherence. The approach uses data-driven priors with Extreme Value Constraints and a scene-flow regularization to robustly constrain geometry and motion, achieving superior results on a motion-blur dynamic dataset. This framework enables realistic, temporally-consistent dynamic view synthesis from blurred inputs, with practical implications for AR/VR and 3D reconstruction in real-world capture conditions.

Abstract

Recent advancements in dynamic neural radiance field methods have yielded remarkable outcomes. However, these approaches rely on the assumption of sharp input images. When faced with motion blur, existing dynamic NeRF methods often struggle to generate high-quality novel views. In this paper, we propose DyBluRF, a dynamic radiance field approach that synthesizes sharp novel views from a monocular video affected by motion blur. To account for motion blur in input images, we simultaneously capture the camera trajectory and object Discrete Cosine Transform (DCT) trajectories within the scene. Additionally, we employ a global cross-time rendering approach to ensure consistent temporal coherence across the entire scene. We curate a dataset comprising diverse dynamic scenes that are specifically tailored for our task. Experimental results on our dataset demonstrate that our method outperforms existing approaches in generating sharp novel views from motion-blurred inputs while maintaining spatial-temporal consistency of the scene.

DyBluRF: Dynamic Neural Radiance Fields from Blurry Monocular Video

TL;DR

DyBluRF addresses the challenge of generating sharp novel views for dynamic scenes captured in motion-blurred monocular video. It jointly models camera trajectories across exposure timestamps and global 3D object motion via learnable DCT trajectories, with a static branch and cross-time rendering to enforce temporal coherence. The approach uses data-driven priors with Extreme Value Constraints and a scene-flow regularization to robustly constrain geometry and motion, achieving superior results on a motion-blur dynamic dataset. This framework enables realistic, temporally-consistent dynamic view synthesis from blurred inputs, with practical implications for AR/VR and 3D reconstruction in real-world capture conditions.

Abstract

Recent advancements in dynamic neural radiance field methods have yielded remarkable outcomes. However, these approaches rely on the assumption of sharp input images. When faced with motion blur, existing dynamic NeRF methods often struggle to generate high-quality novel views. In this paper, we propose DyBluRF, a dynamic radiance field approach that synthesizes sharp novel views from a monocular video affected by motion blur. To account for motion blur in input images, we simultaneously capture the camera trajectory and object Discrete Cosine Transform (DCT) trajectories within the scene. Additionally, we employ a global cross-time rendering approach to ensure consistent temporal coherence across the entire scene. We curate a dataset comprising diverse dynamic scenes that are specifically tailored for our task. Experimental results on our dataset demonstrate that our method outperforms existing approaches in generating sharp novel views from motion-blurred inputs while maintaining spatial-temporal consistency of the scene.
Paper Structure (22 sections, 26 equations, 11 figures, 5 tables)

This paper contains 22 sections, 26 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Given a monocular video capturing a dynamic scene with motion blur, our proposed method, DyBluRF, effectively synthesizes high-quality and sharp novel views compared to previous dynamic NeRF approaches that often yield low-quality and blurry results.
  • Figure 2: Overall pipeline. To model the motion blur of input images, we discretize the exposure time into $n$ timestamps. Subsequently, we perform ray sampling for the same pixel based on the camera poses for each timestamp. We employ two MLPs to represent dynamic scenes. The dynamic model takes spatial point coordinates $\mathbf{x}$, viewing direction $\mathbf{d}$, and time $t$ as inputs and predicts color $\mathbf{c}_t$, volume density $\sigma_t$, DCT coefficients $\Psi_{\mathbf{x}}^t$, and disocclusion weights $\mathcal{W}_t$. The static model only takes $\mathbf{x}$ and $\mathbf{d}$ as inputs and predicts color $\mathbf{c}$, volume density $\sigma$, and a blending weight $v$ for blending static and dynamic results. After obtaining colors and volume densities for static, dynamic, and blended scenes, we use volume rendering to compute pixel RGB values $\mathbf{C}(\mathbf{r})$. We average the RGB values for the same pixel within the exposure time to obtain the predicted blurry image, and calculate losses with the input blurry image. For dynamic and blended images, losses are directly computed against the input image. For the static loss, we use a mask image to calculate the static regions only.
  • Figure 3: Cross-time rendering. We use cross-time rendering to ensure temporal consistency in dynamic scene representation. We render the target timestamp image $\mathbf{C}_l^i(\mathbf{r})$ in frame $i$ utilizing scene information predicted from the corresponding timestamp in other frames. In addition to selecting adjacent frames $j \in \mathcal{N}(i)$, we also choose an extra frame $q$ from the global time to ensure the global consistency of the DCT trajectory.
  • Figure 4: (a) We utilize depth map from MiDaS ranftl2020towards as the supervising signal for optimizing the model. Compared with sharp inputs, blurry images are available but lead to inaccurate depth ground truth. In the example, MiDaS interprets the blurry edges of the person as foreground, causing the person on the depth map to appear 'fatter' than in the ideal sharp one. (b) Using the MiDaS blurry depth to optimize the model directly may predict inaccurate depth maps and distorted novel views. To mitigate this issue, we employ EVC to enhance our model for predicting sharp depth maps and novel views from blurry images.
  • Figure 5: Qualitative comparisons against all baselines. Compared to existing dynamic NeRF methods, our method generates novel view images that are more faithful to the ground truth images, with less blur in both static and dynamic regions.
  • ...and 6 more figures