Table of Contents
Fetching ...

AG-NeRF: Attention-guided Neural Radiance Fields for Multi-height Large-scale Outdoor Scene Rendering

Jingfeng Guo, Xiaohan Zhang, Baozhu Zhao, Qi Liu

TL;DR

This paper tackles the challenge of NeRF-based rendering for large-scale outdoor scenes captured at multiple altitudes, where existing methods struggle with altitude-induced detail variation and long training times. It introduces AG-NeRF, an end-to-end pipeline that selects source images from different heights and employs an attention-based fusion module to extract and combine relevant features for target views, enabling high-fidelity rendering without a priori height assumptions. Across 56 Leonard and Transamerica, AG-NeRF delivers state-of-the-art PSNR improvements while dramatically reducing training time (about half an hour on a single RTX 4090) compared to multi-stage methods like BungeeNeRF. The work demonstrates the practical potential of rapid, multi-height scene reconstruction for urban-scale VR/AR applications by leveraging scene priors from diverse altitudes and efficient feature fusion.

Abstract

Existing neural radiance fields (NeRF)-based novel view synthesis methods for large-scale outdoor scenes are mainly built on a single altitude. Moreover, they often require a priori camera shooting height and scene scope, leading to inefficient and impractical applications when camera altitude changes. In this work, we propose an end-to-end framework, termed AG-NeRF, and seek to reduce the training cost of building good reconstructions by synthesizing free-viewpoint images based on varying altitudes of scenes. Specifically, to tackle the detail variation problem from low altitude (drone-level) to high altitude (satellite-level), a source image selection method and an attention-based feature fusion approach are developed to extract and fuse the most relevant features of target view from multi-height images for high-fidelity rendering. Extensive experiments demonstrate that AG-NeRF achieves SOTA performance on 56 Leonard and Transamerica benchmarks and only requires a half hour of training time to reach the competitive PSNR as compared to the latest BungeeNeRF.

AG-NeRF: Attention-guided Neural Radiance Fields for Multi-height Large-scale Outdoor Scene Rendering

TL;DR

This paper tackles the challenge of NeRF-based rendering for large-scale outdoor scenes captured at multiple altitudes, where existing methods struggle with altitude-induced detail variation and long training times. It introduces AG-NeRF, an end-to-end pipeline that selects source images from different heights and employs an attention-based fusion module to extract and combine relevant features for target views, enabling high-fidelity rendering without a priori height assumptions. Across 56 Leonard and Transamerica, AG-NeRF delivers state-of-the-art PSNR improvements while dramatically reducing training time (about half an hour on a single RTX 4090) compared to multi-stage methods like BungeeNeRF. The work demonstrates the practical potential of rapid, multi-height scene reconstruction for urban-scale VR/AR applications by leveraging scene priors from diverse altitudes and efficient feature fusion.

Abstract

Existing neural radiance fields (NeRF)-based novel view synthesis methods for large-scale outdoor scenes are mainly built on a single altitude. Moreover, they often require a priori camera shooting height and scene scope, leading to inefficient and impractical applications when camera altitude changes. In this work, we propose an end-to-end framework, termed AG-NeRF, and seek to reduce the training cost of building good reconstructions by synthesizing free-viewpoint images based on varying altitudes of scenes. Specifically, to tackle the detail variation problem from low altitude (drone-level) to high altitude (satellite-level), a source image selection method and an attention-based feature fusion approach are developed to extract and fuse the most relevant features of target view from multi-height images for high-fidelity rendering. Extensive experiments demonstrate that AG-NeRF achieves SOTA performance on 56 Leonard and Transamerica benchmarks and only requires a half hour of training time to reach the competitive PSNR as compared to the latest BungeeNeRF.
Paper Structure (13 sections, 6 equations, 5 figures, 2 tables)

This paper contains 13 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Performance comparisons on two benchmark datasets. Left: visualization on Transamerica dataset. The visual results show that the proposal outperforms other competitors and can reconstruct the bridge completely. Right: PSNR versus training time on $56$ Leonard dataset. Compared with others, we observe that ours gets $6\sim7$ dB improvement at PSNR. Moreover, it is worth noting that the proposed method only requires a half hour of training on a single RTX $4090$ GPU to achieve competitive performance as the latest BungeeNeRF xiangli2022bungeenerf (training over five days).
  • Figure 2: Our pipline. First, according to the camera's external matrix, we select source images that are most similar to the target view from different heights. Next, a trainable U-Net-like network extracts feature maps from these source images. The $3$D sample points along the rays are then projected back onto the image planes and interpolated for the corresponding feature vectors. Subsequently, these feature vectors interact with each other through an attention-based feature fusion approach and are fed into MLPs along with positional encoding. Finally, pixel color is calculated by volume rendering.
  • Figure 3: Qualitative comparisons on 56 Leonard dataset
  • Figure 4: Qualitative comparisons on Transamerica dataset
  • Figure 5: Comparison on the effect of source image number. The horizontal axis represents the number of chosen source images.