Table of Contents
Fetching ...

Dynamic Neural Radiance Field From Defocused Monocular Video

Xianrui Luo, Huiqiang Sun, Juewen Peng, Zhiguo Cao

TL;DR

Defocus blur in monocular dynamic videos hinders accurate dynamic NeRF reconstruction. D^2RF integrates layered Depth-of-Field volume rendering into dynamic NeRF training, converting layer-based blur into ray-based kernels and using a sparse kernel for efficiency, while blending static and dynamic components and enforcing cross-time consistency. The approach is evaluated on a defocused dynamic dataset synthesized from VDW, showing significant gains over state-of-the-art baselines and deblurring pipelines in both perceptual and photometric metrics. This work enables sharp, temporally coherent novel views from realistically blurry monocular videos, with implications for AR/VR and video editing workflows.

Abstract

Dynamic Neural Radiance Field (NeRF) from monocular videos has recently been explored for space-time novel view synthesis and achieved excellent results. However, defocus blur caused by depth variation often occurs in video capture, compromising the quality of dynamic reconstruction because the lack of sharp details interferes with modeling temporal consistency between input views. To tackle this issue, we propose D2RF, the first dynamic NeRF method designed to restore sharp novel views from defocused monocular videos. We introduce layered Depth-of-Field (DoF) volume rendering to model the defocus blur and reconstruct a sharp NeRF supervised by defocused views. The blur model is inspired by the connection between DoF rendering and volume rendering. The opacity in volume rendering aligns with the layer visibility in DoF rendering. To execute the blurring, we modify the layered blur kernel to the ray-based kernel and employ an optimized sparse kernel to gather the input rays efficiently and render the optimized rays with our layered DoF volume rendering. We synthesize a dataset with defocused dynamic scenes for our task, and extensive experiments on our dataset show that our method outperforms existing approaches in synthesizing all-in-focus novel views from defocus blur while maintaining spatial-temporal consistency in the scene.

Dynamic Neural Radiance Field From Defocused Monocular Video

TL;DR

Defocus blur in monocular dynamic videos hinders accurate dynamic NeRF reconstruction. D^2RF integrates layered Depth-of-Field volume rendering into dynamic NeRF training, converting layer-based blur into ray-based kernels and using a sparse kernel for efficiency, while blending static and dynamic components and enforcing cross-time consistency. The approach is evaluated on a defocused dynamic dataset synthesized from VDW, showing significant gains over state-of-the-art baselines and deblurring pipelines in both perceptual and photometric metrics. This work enables sharp, temporally coherent novel views from realistically blurry monocular videos, with implications for AR/VR and video editing workflows.

Abstract

Dynamic Neural Radiance Field (NeRF) from monocular videos has recently been explored for space-time novel view synthesis and achieved excellent results. However, defocus blur caused by depth variation often occurs in video capture, compromising the quality of dynamic reconstruction because the lack of sharp details interferes with modeling temporal consistency between input views. To tackle this issue, we propose D2RF, the first dynamic NeRF method designed to restore sharp novel views from defocused monocular videos. We introduce layered Depth-of-Field (DoF) volume rendering to model the defocus blur and reconstruct a sharp NeRF supervised by defocused views. The blur model is inspired by the connection between DoF rendering and volume rendering. The opacity in volume rendering aligns with the layer visibility in DoF rendering. To execute the blurring, we modify the layered blur kernel to the ray-based kernel and employ an optimized sparse kernel to gather the input rays efficiently and render the optimized rays with our layered DoF volume rendering. We synthesize a dataset with defocused dynamic scenes for our task, and extensive experiments on our dataset show that our method outperforms existing approaches in synthesizing all-in-focus novel views from defocus blur while maintaining spatial-temporal consistency in the scene.
Paper Structure (22 sections, 18 equations, 11 figures, 6 tables)

This paper contains 22 sections, 18 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Given a monocular video captured with defocus blur, existing dynamic NeRF approaches fail to recover high-quality details and tend to produce blurry views, and our method $D^{2}RF$ synthesizes sharp novel views.
  • Figure 2: The defocus blur formation. We model the defocus blur of the center purple pixel. The blur and green objects are from different layers. We define the light path as the left-right view and the image plane shows the front-back view of the defocus blur. The CoC of the purple point gathers its neighboring pixels on the green foreground and those on the blue background. This defocus modeling indicates that the object originally occluded (red square from the blue object) by the green object under the pinhole camera view can contribute to the rendered purple point by the green and blue semi-circles. In Eq.\ref{['eq:bokeh']}-\ref{['eq:volume_2']} we model the defocus blur from layer visibility $W_i$, then use the link between visibility and opacity to integrate DoF and volume rendering in Eq.\ref{['eq:bokeh_volume']}. The visibility is learned from the layered DoF volume rendering pipeline.
  • Figure 3: Pipeline of our framework. The framework takes a set of plane coordinates $(u,v)$, the time embedding $t_i$, and defines a kernel template as inputs, the outputs are the blur kernels $K(\hbox{\boldmath{$r_i$}})$ consisting of the sparse optimized rays with their corresponding weights. The rays are then fed to the two MLPs $G_{\theta}^{\textrm{st}}$ and $G_{\theta}^{\textrm{dy}}$ to independently represent static and dynamic scenes. The final color is rendered by layered DoF volume rendering (Section \ref{['sec:method_3']}), $\hat{C}_{dof}(\hbox{\boldmath{$r$}})$ is from Eq.\ref{['eq:bokeh_volume']} and $\hat{C}_{dof}^{t}(\hbox{\boldmath{$r$}})$ is from Eq.\ref{['eq:blend']}. The rendered defocused results (dynamic and blended) are supervised by the input defocused views. For testing we directly render the rays without layered DoF volume rendering and the kernel.
  • Figure 4: The qualitative results with all dynamic NeRF baselines. Compared with existing dynamic NeRF methods, our method generates sharper novel views that are more faithful and have more details. The scenes are Mountain, Shop, Car, Dock.
  • Figure 5: The qualitative results with dynamic NeRF and their corresponding 2D image deblurring baselines. Although 2D image deblurring helps to alleviate the blur for dynamic NeRF in novel views (red box), our method is more stable and generates more reliable sharp details. The scenes are Camp, Dining1, Dining2, Gate.
  • ...and 6 more figures