Table of Contents
Fetching ...

OmniLocalRF: Omnidirectional Local Radiance Fields from Dynamic Videos

Dongyoung Choi, Hyeonjoong Jang, Min H. Kim

TL;DR

OmniLocalRF tackles omnidirectional view synthesis from dynamic 360° videos by extracting static-scene radiance fields while removing dynamic objects. It introduces bidirectional optimization across distant frames and a motion-mask module based on multi-resolution feature planes to separate dynamic and static components, enabling high-fidelity inpainting and pose estimation within long trajectories. The approach demonstrates superiority over state-of-the-art omnidirectional methods in both quantitative metrics and visual quality on real and synthetic datasets, and it achieves robust camera pose estimation without manual masks or external priors. Practical impact includes enabling photorealistic static-scene views for street-scale applications and AR/VR scenarios where dynamic foregrounds would otherwise degrade novel-view rendering.

Abstract

Omnidirectional cameras are extensively used in various applications to provide a wide field of vision. However, they face a challenge in synthesizing novel views due to the inevitable presence of dynamic objects, including the photographer, in their wide field of view. In this paper, we introduce a new approach called Omnidirectional Local Radiance Fields (OmniLocalRF) that can render static-only scene views, removing and inpainting dynamic objects simultaneously. Our approach combines the principles of local radiance fields with the bidirectional optimization of omnidirectional rays. Our input is an omnidirectional video, and we evaluate the mutual observations of the entire angle between the previous and current frames. To reduce ghosting artifacts of dynamic objects and inpaint occlusions, we devise a multi-resolution motion mask prediction module. Unlike existing methods that primarily separate dynamic components through the temporal domain, our method uses multi-resolution neural feature planes for precise segmentation, which is more suitable for long 360-degree videos. Our experiments validate that OmniLocalRF outperforms existing methods in both qualitative and quantitative metrics, especially in scenarios with complex real-world scenes. In particular, our approach eliminates the need for manual interaction, such as drawing motion masks by hand and additional pose estimation, making it a highly effective and efficient solution.

OmniLocalRF: Omnidirectional Local Radiance Fields from Dynamic Videos

TL;DR

OmniLocalRF tackles omnidirectional view synthesis from dynamic 360° videos by extracting static-scene radiance fields while removing dynamic objects. It introduces bidirectional optimization across distant frames and a motion-mask module based on multi-resolution feature planes to separate dynamic and static components, enabling high-fidelity inpainting and pose estimation within long trajectories. The approach demonstrates superiority over state-of-the-art omnidirectional methods in both quantitative metrics and visual quality on real and synthetic datasets, and it achieves robust camera pose estimation without manual masks or external priors. Practical impact includes enabling photorealistic static-scene views for street-scale applications and AR/VR scenarios where dynamic foregrounds would otherwise degrade novel-view rendering.

Abstract

Omnidirectional cameras are extensively used in various applications to provide a wide field of vision. However, they face a challenge in synthesizing novel views due to the inevitable presence of dynamic objects, including the photographer, in their wide field of view. In this paper, we introduce a new approach called Omnidirectional Local Radiance Fields (OmniLocalRF) that can render static-only scene views, removing and inpainting dynamic objects simultaneously. Our approach combines the principles of local radiance fields with the bidirectional optimization of omnidirectional rays. Our input is an omnidirectional video, and we evaluate the mutual observations of the entire angle between the previous and current frames. To reduce ghosting artifacts of dynamic objects and inpaint occlusions, we devise a multi-resolution motion mask prediction module. Unlike existing methods that primarily separate dynamic components through the temporal domain, our method uses multi-resolution neural feature planes for precise segmentation, which is more suitable for long 360-degree videos. Our experiments validate that OmniLocalRF outperforms existing methods in both qualitative and quantitative metrics, especially in scenarios with complex real-world scenes. In particular, our approach eliminates the need for manual interaction, such as drawing motion masks by hand and additional pose estimation, making it a highly effective and efficient solution.
Paper Structure (21 sections, 14 equations, 15 figures, 7 tables)

This paper contains 21 sections, 14 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: We introduce omnidirectional local radiance fields for photorealistic view synthesis of static scenery from 360° videos. Our method effectively removes dynamic objects (including the photographer) without manual interaction. Also, it achieves high-resolution details in the inpainted regions by means of bidirectional observations of omnidirectional local radiance fields. Refer to the supplemental video for more results.
  • Figure 2: In the perspective video of marching forward, optimized radiance blocks $\mathbf{RF}_{\Theta_p}$ may not be visible in the frame used to train current radiance fields $\mathbf{RF}_{\Theta_c}$. However, in omnidirectional video, every uncontracted space of the optimized blocks can be seen, enabling effective bidirectional optimization. The boundary indicates the radiance fields' focusing region uncontracted.
  • Figure 3: Our bidirectional optimization for omnidirectional videos. (a) In the forward step, we project the point $\mathbf{P}_c(\mathbf{r}_\text{src})$ rendered by $\mathbf{RF}_{\Theta_c}$ to the destination frame $w(p,j)$, used to train the previous radiance block $\mathbf{RF}_{\Theta_p}$. We then render the color and depth through $\mathbf{RF}_{\Theta_c}$ and $\mathbf{RF}_{\Theta_p}$, respectively, and use the L1 photometric error between the fully rendered color $\hat{\mathbf{C}}_c(\mathbf{r}_\text{dst})$ and the bilinearly interpolated input image $\bar{\mathbf{C}}(\mathbf{r}_\text{dst})$ (Eq. \ref{['eq:forward_rgb']}) to update $\mathbf{RF}_{\Theta_c}$ and a mask module. (b) In the backward step, we switch the source and destination frames and refine $\mathbf{RF}_{\Theta_p}$ through the valid rays from static areas that meet $\mathcal{R}_\text{P}$.
  • Figure 4: Ablation study on the impact of the backward step. (b) Solely employing the forward step results in a blurred image. (c) Omitting the utilization of Eq. \ref{['eq:backward_rgb_src']} leads to overfitting on distant frames. (d) Our bidirectional optimization shows great quality in representing details.
  • Figure 5: For motion mask prediction, we cast a ray $\mathbf{r}$ from the $k$-th frame and render the static structure $\hat{\mathbf{C}}^\text{st}(\mathbf{r})$ through volume rendering using radiance fields $\mathbf{RF}_{\Theta}$. We extract multi-resolution features of normalized $(u,v)$ by traversing feature plane set $\mathcal{Z}_k$ and concatenate them into a single code $\mathbf{z}^{k}_{(u,v)}$. We estimate dynamic color $\hat{\mathbf{C}}^\text{dy}(\mathbf{r})$ and motion mask $\hat{M}(\mathbf r)$, and render the final results $\hat{\mathbf{C}}(\mathbf{r})$ through dynamic compositing (Eq. \ref{['eq:dynamic_compositing']}). We jointly update the mask module $(\mathcal{Z}_t, {F_{{\Theta _{D}}}})$ with radiance fields $\mathbf{RF}_{\Theta}$ using L1 photometric loss. We supervise $\hat{\mathbf{C}}^\text{dy}(\mathbf{r})$ by $\tilde{\mathbf{C}}(\mathbf{r})$ for unique factorization (Eq. \ref{['eq:mask_rgb']}) and regularize the alpha of the mask (Eq. \ref{['eq:mask_reg']}).
  • ...and 10 more figures