Table of Contents
Fetching ...

Extrapolated Urban View Synthesis Benchmark

Xiangyu Han, Zhen Jia, Boyi Li, Yan Wang, Boris Ivanovic, Yurong You, Lingjie Liu, Yue Wang, Marco Pavone, Chen Feng, Yiming Li

TL;DR

The paper introduces the Extrapolated Urban View Synthesis (EUVS) benchmark to quantify how well state-of-the-art NVS methods generalize to extrapolated viewpoints in urban driving. Using real-world datasets with multi-traversal, multi-agent, and multi-camera recordings, EUVS defines three evaluation settings (translation-only, rotation-only, translation+rotation) and benchmarks Gaussian Splatting and NeRF-based approaches, revealing substantial generalization gaps and overfitting to training views. Across settings, diffusion priors, depth regularization, and multi-traversal data offer partial gains, but no method fully resolves extrapolation challenges, underscoring the need for more robust representations and large-scale training. The authors also provide a dataset and evaluation protocol to advance photorealistic urban NVS for autonomous driving simulation and robotics, with plans to release data and tools upon acceptance.

Abstract

Photorealistic simulators are essential for the training and evaluation of vision-centric autonomous vehicles (AVs). At their core is Novel View Synthesis (NVS), a crucial capability that generates diverse unseen viewpoints to accommodate the broad and continuous pose distribution of AVs. Recent advances in radiance fields, such as 3D Gaussian Splatting, achieve photorealistic rendering at real-time speeds and have been widely used in modeling large-scale driving scenes. However, their performance is commonly evaluated using an interpolated setup with highly correlated training and test views. In contrast, extrapolation, where test views largely deviate from training views, remains underexplored, limiting progress in generalizable simulation technology. To address this gap, we leverage publicly available AV datasets with multiple traversals, multiple vehicles, and multiple cameras to build the first Extrapolated Urban View Synthesis (EUVS) benchmark. Meanwhile, we conduct both quantitative and qualitative evaluations of state-of-the-art NVS methods across different evaluation settings. Our results show that current NVS methods are prone to overfitting to training views. Besides, incorporating diffusion priors and improving geometry cannot fundamentally improve NVS under large view changes, highlighting the need for more robust approaches and large-scale training. We will release the data to help advance self-driving and urban robotics simulation technology.

Extrapolated Urban View Synthesis Benchmark

TL;DR

The paper introduces the Extrapolated Urban View Synthesis (EUVS) benchmark to quantify how well state-of-the-art NVS methods generalize to extrapolated viewpoints in urban driving. Using real-world datasets with multi-traversal, multi-agent, and multi-camera recordings, EUVS defines three evaluation settings (translation-only, rotation-only, translation+rotation) and benchmarks Gaussian Splatting and NeRF-based approaches, revealing substantial generalization gaps and overfitting to training views. Across settings, diffusion priors, depth regularization, and multi-traversal data offer partial gains, but no method fully resolves extrapolation challenges, underscoring the need for more robust representations and large-scale training. The authors also provide a dataset and evaluation protocol to advance photorealistic urban NVS for autonomous driving simulation and robotics, with plans to release data and tools upon acceptance.

Abstract

Photorealistic simulators are essential for the training and evaluation of vision-centric autonomous vehicles (AVs). At their core is Novel View Synthesis (NVS), a crucial capability that generates diverse unseen viewpoints to accommodate the broad and continuous pose distribution of AVs. Recent advances in radiance fields, such as 3D Gaussian Splatting, achieve photorealistic rendering at real-time speeds and have been widely used in modeling large-scale driving scenes. However, their performance is commonly evaluated using an interpolated setup with highly correlated training and test views. In contrast, extrapolation, where test views largely deviate from training views, remains underexplored, limiting progress in generalizable simulation technology. To address this gap, we leverage publicly available AV datasets with multiple traversals, multiple vehicles, and multiple cameras to build the first Extrapolated Urban View Synthesis (EUVS) benchmark. Meanwhile, we conduct both quantitative and qualitative evaluations of state-of-the-art NVS methods across different evaluation settings. Our results show that current NVS methods are prone to overfitting to training views. Besides, incorporating diffusion priors and improving geometry cannot fundamentally improve NVS under large view changes, highlighting the need for more robust approaches and large-scale training. We will release the data to help advance self-driving and urban robotics simulation technology.

Paper Structure

This paper contains 11 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Our key contributions. Previous evaluations for urban view synthesis have primarily focused on interpolated poses, as the lack of ground truth data has made it challenging to evaluate extrapolated poses. We address this gap by providing real-world data that enables both quantitative and qualitative evaluations of extrapolated view synthesis in urban scenes. The quantitative results reveal a significant performance drop in 3D Gaussian Splatting kerbl3Dgaussians when handling extrapolated views, highlighting the need for more robust NVS methods.
  • Figure 2: Dataset visualization. Our dataset features diverse scenes across various locations in different cities, sourced from multiple datasets. Typical driving scenarios include maneuvers such as lane changes, cross intersections, and T-junctions. Top: Each column displays images captured at the same location by different agents or traversals. Bottom: Each image displays the COLMAP points at a specific location, along with the corresponding camera poses.
  • Figure 3: Dataset distribution. Our dataset comprises 90,810 frames distributed over 104 cases, capturing a diverse array of multi-traversal paths, multi-agent interactions, and multi-camera perspectives across varying evaluation settings.
  • Figure 4: Qualitative and quantitative results across three evaluation settings. The performance drop from interpolation to extrapolation is significant in both qualitative and quantitative comparison. Different testing settings have distinct scenario characteristics, enabling the evaluation of a method's capabilities from various aspects, thus systematically assessing the overall performance of reconstruction algorithms, including geometric accuracy, hallucination ability, view consistency, and depth precision, etc.
  • Figure 5: Qualitative comparison of extrapolated view synthesis across different settings. For each setting, results from different methods are compared against the ground truth. Red boxes highlight areas where methods are limited in capturing fine details, such as road surfaces, sky regions, or object boundaries, demonstrating the challenges faced by each approach under varying movement complexities.
  • ...and 8 more figures