Aug3D: Augmenting large scale outdoor datasets for Generalizable Novel View Synthesis

Aditya Rauniyar; Omar Alama; Silong Yong; Katia Sycara; Sebastian Scherer

Aug3D: Augmenting large scale outdoor datasets for Generalizable Novel View Synthesis

Aditya Rauniyar, Omar Alama, Silong Yong, Katia Sycara, Sebastian Scherer

TL;DR

This work tackles the challenge of generalizable novel view synthesis (GNVS) in large-scale outdoor environments by proposing Aug3D, a reconstruction-based augmentation framework. It combines SfM-based view clustering to ensure high overlap with multiscale grid and semantic building sampling to generate diverse, well-conditioned novel views, then augments real outdoor data to train feed-forward GNVS models like PixelNeRF. Key findings show SfM shared-point clustering yields the best training overlap and that smaller cluster sizes further boost PSNR, while semantic augmentation provides notable gains when fused with real data. The approach enables more robust learning of urban scene priors for GNVS, with potential impact on outdoor AR/VR, city-scale mapping, and view-synthesis systems that require generalization across diverse environments.

Abstract

Recent photorealistic Novel View Synthesis (NVS) advances have increasingly gained attention. However, these approaches remain constrained to small indoor scenes. While optimization-based NVS models have attempted to address this, generalizable feed-forward methods, offering significant advantages, remain underexplored. In this work, we train PixelNeRF, a feed-forward NVS model, on the large-scale UrbanScene3D dataset. We propose four training strategies to cluster and train on this dataset, highlighting that performance is hindered by limited view overlap. To address this, we introduce Aug3D, an augmentation technique that leverages reconstructed scenes using traditional Structure-from-Motion (SfM). Aug3D generates well-conditioned novel views through grid and semantic sampling to enhance feed-forward NVS model learning. Our experiments reveal that reducing the number of views per cluster from 20 to 10 improves PSNR by 10%, but the performance remains suboptimal. Aug3D further addresses this by combining the newly generated novel views with the original dataset, demonstrating its effectiveness in improving the model's ability to predict novel views.

Aug3D: Augmenting large scale outdoor datasets for Generalizable Novel View Synthesis

TL;DR

Abstract

Paper Structure (14 sections, 7 figures, 2 tables)

This paper contains 14 sections, 7 figures, 2 tables.

Introduction
Related Work
Approach
Data curation for Generalizable NVS
Capture Sequence grouping
Grid-Based Grouping
Ray intersection with ground plane
SfM shared points
Augmentation
Multiscale Grid Sampling
Semantic Building Sampling
Experimental Setup
Results
Discussion

Figures (7)

Figure 1: Scene clustering methods for training GNVS models. Colored cameras represent cameras within the same cluster. (a), (b) and (c) show edge cases where these methods would cluster wrong images into a scene.
Figure 2: Two types of augmentation to reduce low overlap among outdoor scene datasets: (a) Multiscale Grid Sampling and (b) Semantic Sampling. The left figure shows dynamic camera placements for varying grid scales, and the right figure illustrates focused sampling around urban regions.
Figure 3: Qualitative comparison of clustering methods for aerial image grouping. Each column represents a method: (a) Camera Sequence groups images with overlap in scenes 1 and 2, but misses 3 and 4. (b) Grid-Based grouping overlaps scenes 1 and 3, missing others. (c) Ray Intersection captures overlap in scenes 1, 2, and partly 3, but not 4. (d) SfM Shared grouping achieves high overlap across all scenes, demonstrating superior performance.
Figure 4: Qualitative comparison of models trained with 10 images per cluster versus 20 images per cluster using the SfM shared points method on a campus scene. Reducing the cluster size from 20 to 10 demonstrates marginally improved visual quality. The first row represents fine-grained predictions, while the second row shows coarse-grained predictions. Columns depict Input Views, Ground Truth, Depth, and Predictions.
Figure 5: Qualitative comparison of PixelNeRF trained exclusively on synthetic datasets generated using grid sampling versus semantic sampling methods on the UrbanScene3D Campus scene. The first row represents fine-grained predictions, while the second row shows coarse-grained predictions.
...and 2 more figures

Aug3D: Augmenting large scale outdoor datasets for Generalizable Novel View Synthesis

TL;DR

Abstract

Aug3D: Augmenting large scale outdoor datasets for Generalizable Novel View Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (7)