Radiance Field Learners As UAV First-Person Viewers

Liqi Yan; Qifan Wang; Junhan Zhao; Qiang Guan; Zheng Tang; Jianhui Zhang; Dongfang Liu

Radiance Field Learners As UAV First-Person Viewers

Liqi Yan, Qifan Wang, Junhan Zhao, Qiang Guan, Zheng Tang, Jianhui Zhang, Dongfang Liu

TL;DR

FPV-NeRF addresses the challenges of synthesizing First-Person-View views from UAV videos by introducing a multi-scale camera space estimation framework, a global-local scene encoder with cross-resolution attention, and a comprehensive three-term loss that jointly enforces temporal consistency, global structural integrity, and local detail. The approach subdivides the airspace into regions with region-specific warp functions, uses a hash-encoded feature pool and volume features, and employs a cross-resolution attention mechanism to fuse information across scales, yielding improved reconstructions across outdoor-to-indoor transitions. A new UAV dataset with diverse trajectories demonstrates that FPV-NeRF achieves substantial PSNR/SSIM gains over state-of-the-art NeRF methods while maintaining feasible per-view rendering times, and ablations confirm the importance of each component and loss term. The work advances practical UAV spatial perception and enables offline training for navigation and perception tasks such as object detection.

Abstract

First-Person-View (FPV) holds immense potential for revolutionizing the trajectory of Unmanned Aerial Vehicles (UAVs), offering an exhilarating avenue for navigating complex building structures. Yet, traditional Neural Radiance Field (NeRF) methods face challenges such as sampling single points per iteration and requiring an extensive array of views for supervision. UAV videos exacerbate these issues with limited viewpoints and significant spatial scale variations, resulting in inadequate detail rendering across diverse scales. In response, we introduce FPV-NeRF, addressing these challenges through three key facets: (1) Temporal consistency. Leveraging spatio-temporal continuity ensures seamless coherence between frames; (2) Global structure. Incorporating various global features during point sampling preserves space integrity; (3) Local granularity. Employing a comprehensive framework and multi-resolution supervision for multi-scale scene feature representation tackles the intricacies of UAV video spatial scales. Additionally, due to the scarcity of publicly available FPV videos, we introduce an innovative view synthesis method using NeRF to generate FPV perspectives from UAV footage, enhancing spatial perception for drones. Our novel dataset spans diverse trajectories, from outdoor to indoor environments, in the UAV domain, differing significantly from traditional NeRF scenarios. Through extensive experiments encompassing both interior and exterior building structures, FPV-NeRF demonstrates a superior understanding of the UAV flying space, outperforming state-of-the-art methods in our curated UAV dataset. Explore our project page for further insights: https://fpv-nerf.github.io/.

Radiance Field Learners As UAV First-Person Viewers

TL;DR

Abstract

Paper Structure (13 sections, 10 equations, 8 figures, 6 tables)

This paper contains 13 sections, 10 equations, 8 figures, 6 tables.

Introduction
Related Works
Methods
Overview
Multi-Scale Camera Space Estimation
Global-Local Scene Encoder
Comprehensive Learning Objective
Experiments
UAV Dataset Collection
Experimental Settings
Comparison with SOTA
Ablation Study
Conclusion

Figures (8)

Figure 1: Comparison of our proposed FPV-NeRF and previous NeRF-based methods. Previous NeRF can be divided into two types: forward-facing and 360° object centric. In UAV videos, view synthesizing faces the following challenges: 1) Degree of view restriction, as UAV perspectives are limited by drone trajectories; and 2) Scene change, as UAVs encounter significant changes in scene scale and lighting conditions when transitioning from outdoors to indoors.
Figure 2: The overall framework of our method. After estimating the camera location and pose space using various wrapping functions, we can sample a pixel in a frame as a view ray, which consists of a sequence of point positions in this estimated space. During training, we use these point positions to query their learnable features from a feature pool. Then, we pass those point positions and corresponding features through a global-local encoder and rendering decoder to obtain the predicted color for this pixel. Comparing this predicted pixel color with the ground truth color of this pixel can supervise the network. During testing, we input a novel auto-generated point position sequence into this pipeline and can finally obtain a novel first-person view video (see Fig. \ref{['fig:vis_trajectory']}).
Figure 3: Visualization of UAV trajectory and auto-generated FPV video trajectories with camera poses. (a-b) Tentage scene. (c-d) Market scene.
Figure 4: Illustration of disparity alignment. The absence of local granularity results in a substantial difference in reciprocal depth disparity between two resolutions (see § \ref{['sec:loss']}).
Figure 5: Qualification comparison results with several SOTA methods. It can be seen that the synthesized frames of DVGO is blurred due to their limited resolutions to represent such a long trajectory. The results of Mip-NeRF-360 and F$^2$-NeRF have local noise and distortion due to its unbalanced scene space organization. In comparison, our FPV-NeRF takes advantage of the adaptive space subdivision and considers scene features in different scale to fully exploit the global-local representation capacity.
...and 3 more figures

Radiance Field Learners As UAV First-Person Viewers

TL;DR

Abstract

Radiance Field Learners As UAV First-Person Viewers

Authors

TL;DR

Abstract

Table of Contents

Figures (8)