Table of Contents
Fetching ...

To View Transform or Not to View Transform: NeRF-based Pre-training Perspective

Hyeonjun Jeong, Juyeb Shin, Dongsuk Kum

Abstract

Neural radiance fields (NeRFs) have emerged as a prominent pre-training paradigm for vision-centric autonomous driving, which enhances 3D geometry and appearance understanding in a fully self-supervised manner. To apply NeRF-based pretraining to 3D perception models, recent approaches have simply applied NeRFs to volumetric features obtained from view transformation. However, coupling NeRFs with view transformation inherits conflicting priors; view transformation imposes discrete and rigid representations, whereas radiance fields assume continuous and adaptive functions. When these opposing assumptions are forced into a single pipeline, the misalignment surfaces as blurry and ambiguous 3D representations that ultimately limit 3D scene understanding. Moreover, the NeRF network for pre-training is discarded during downstream tasks, resulting in inefficient utilization of enhanced 3D representations through NeRF. In this paper, we propose a novel NeRF-Resembled Point-based 3D detector that can learn continuous 3D representation and thus avoid the misaligned priors from view transformation. NeRP3D preserves the pre-trained NeRF network regardless of the tasks, inheriting the principle of continuous 3D representation learning and leading to greater potentials for both scene reconstruction and detection tasks. Experiments on nuScenes dataset demonstrate that our proposed approach significantly improves previous state-of-the-art methods, outperforming not only pretext scene reconstruction tasks but also downstream detection tasks.

To View Transform or Not to View Transform: NeRF-based Pre-training Perspective

Abstract

Neural radiance fields (NeRFs) have emerged as a prominent pre-training paradigm for vision-centric autonomous driving, which enhances 3D geometry and appearance understanding in a fully self-supervised manner. To apply NeRF-based pretraining to 3D perception models, recent approaches have simply applied NeRFs to volumetric features obtained from view transformation. However, coupling NeRFs with view transformation inherits conflicting priors; view transformation imposes discrete and rigid representations, whereas radiance fields assume continuous and adaptive functions. When these opposing assumptions are forced into a single pipeline, the misalignment surfaces as blurry and ambiguous 3D representations that ultimately limit 3D scene understanding. Moreover, the NeRF network for pre-training is discarded during downstream tasks, resulting in inefficient utilization of enhanced 3D representations through NeRF. In this paper, we propose a novel NeRF-Resembled Point-based 3D detector that can learn continuous 3D representation and thus avoid the misaligned priors from view transformation. NeRP3D preserves the pre-trained NeRF network regardless of the tasks, inheriting the principle of continuous 3D representation learning and leading to greater potentials for both scene reconstruction and detection tasks. Experiments on nuScenes dataset demonstrate that our proposed approach significantly improves previous state-of-the-art methods, outperforming not only pretext scene reconstruction tasks but also downstream detection tasks.

Paper Structure

This paper contains 32 sections, 6 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Comparison of 2D feature maps (left) and their instance segmentation (right) results using SAM kirillov2023samren2024groundedravi2024sam2 across different methods. All 2D feature maps, except for ground truth RGB (row 1) and DINO caron2021emergingoquab2023dinov2 feature (row 5), are obtained by accumulating 3D point-wise representations along each ray onto the image plane with predicted density. They are extracted directly after radiance field pre-training without any task-specific fine-tuning. UniPAD yang2024unipad (row 2) and SelfOcc huang2024selfocc (row 3) produce blurry and inaccurate features that fail to separate nearby or crowded objects, resulting in under-segmented instances. In contrast, NeRP3D (row 4) produces precise and well-localized features with distinct object boundaries without any distillation or fine-tuning from 2D foundation models, comparable to those from DINO features. Consequently, we observe the potential for the enhancement of 3D representation to be reflected in the improved instance segmentation quality.
  • Figure 2: Comparison of the previous NeRF-based pre-training methods and our NeRP3D pipeline.
  • Figure 3: Overview of NeRP3D, illustrating both pre-training for rendering (orange) and fine-tuning for downstream (blue) pipelines. Through NeRF-resembled design, our method maintains a coherent 3D understanding from scattered points across diverse tasks while accommodating task-specific point sampling strategies, enabling the model to effectively leverage underlying geometric and appearance information while allowing for task-dependent feature specialization.
  • Figure 4: Qualitative comparison on rendered RGB & depth. NeRP3D outperforms state-of-the-art methods on both RGB and depth reconstruction. Our approach maintains high fidelity in urban scenes without any blur and pattern artifacts. For depth estimation, NeRP3D distinguishes individual people in crowded areas rather than merging them into indistinct blobs, and precisely captures thin structures such as poles that are often missed or reconstructed as thick structures by competing methods.
  • Figure 5: Qualitative comparison of 3D object detection results. NeRP3D consistently generates more accurate and reliable 3D bounding boxes. It demonstrates key advantages such as successfully detecting partially occluded objects in dense crowds (top row), reducing false positives for cleaner predictions (middle row), and more accurately localizing the position of small objects like pedestrians (bottom row).
  • ...and 3 more figures