Table of Contents
Fetching ...

Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles

Chuang Lin, Bingbing Zhuang, Shanlin Sun, Ziyu Jiang, Jianfei Cai, Manmohan Chandraker

TL;DR

This work addresses the gap between synthetic 3D priors used by diffusion-based novel view synthesis and real-world vehicle imagery. It introduces Drive-1-to-3, a domain-adaptive finetuning pipeline that (i) maps real camera poses to an object-centric orbital pose, (ii) crops object-centric patches with a fixed focal length to stabilize learning, (iii) performs occlusion-aware latent-space training, and (iv) imposes a left-right symmetric prior to handle large viewpoint changes. The approach yields substantial improvements, including a $68.8\%$ reduction in FID over prior arts and strong compatibility with downstream 3D reconstruction (LGM) and object-insertion tasks, while achieving efficient training on modest hardware. These findings demonstrate that leveraging rich pretrained diffusion priors with targeted domain adaptations can produce high-fidelity real-vehicle NVS with practical compute, enabling scalable vehicle asset harvesting for autonomous driving applications. The method generalizes across datasets and supports integration into simulation pipelines for safety-aware testing and 3D reconstruction.

Abstract

The recent advent of large-scale 3D data, e.g. Objaverse, has led to impressive progress in training pose-conditioned diffusion models for novel view synthesis. However, due to the synthetic nature of such 3D data, their performance drops significantly when applied to real-world images. This paper consolidates a set of good practices to finetune large pretrained models for a real-world task -- harvesting vehicle assets for autonomous driving applications. To this end, we delve into the discrepancies between the synthetic data and real driving data, then develop several strategies to account for them properly. Specifically, we start with a virtual camera rotation of real images to ensure geometric alignment with synthetic data and consistency with the pose manifold defined by pretrained models. We also identify important design choices in object-centric data curation to account for varying object distances in real driving scenes -- learn across varying object scales with fixed camera focal length. Further, we perform occlusion-aware training in latent spaces to account for ubiquitous occlusions in real data, and handle large viewpoint changes by leveraging a symmetric prior. Our insights lead to effective finetuning that results in a $68.8\%$ reduction in FID for novel view synthesis over prior arts.

Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles

TL;DR

This work addresses the gap between synthetic 3D priors used by diffusion-based novel view synthesis and real-world vehicle imagery. It introduces Drive-1-to-3, a domain-adaptive finetuning pipeline that (i) maps real camera poses to an object-centric orbital pose, (ii) crops object-centric patches with a fixed focal length to stabilize learning, (iii) performs occlusion-aware latent-space training, and (iv) imposes a left-right symmetric prior to handle large viewpoint changes. The approach yields substantial improvements, including a reduction in FID over prior arts and strong compatibility with downstream 3D reconstruction (LGM) and object-insertion tasks, while achieving efficient training on modest hardware. These findings demonstrate that leveraging rich pretrained diffusion priors with targeted domain adaptations can produce high-fidelity real-vehicle NVS with practical compute, enabling scalable vehicle asset harvesting for autonomous driving applications. The method generalizes across datasets and supports integration into simulation pipelines for safety-aware testing and 3D reconstruction.

Abstract

The recent advent of large-scale 3D data, e.g. Objaverse, has led to impressive progress in training pose-conditioned diffusion models for novel view synthesis. However, due to the synthetic nature of such 3D data, their performance drops significantly when applied to real-world images. This paper consolidates a set of good practices to finetune large pretrained models for a real-world task -- harvesting vehicle assets for autonomous driving applications. To this end, we delve into the discrepancies between the synthetic data and real driving data, then develop several strategies to account for them properly. Specifically, we start with a virtual camera rotation of real images to ensure geometric alignment with synthetic data and consistency with the pose manifold defined by pretrained models. We also identify important design choices in object-centric data curation to account for varying object distances in real driving scenes -- learn across varying object scales with fixed camera focal length. Further, we perform occlusion-aware training in latent spaces to account for ubiquitous occlusions in real data, and handle large viewpoint changes by leveraging a symmetric prior. Our insights lead to effective finetuning that results in a reduction in FID for novel view synthesis over prior arts.

Paper Structure

This paper contains 18 sections, 3 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Comparisons between the pretrained Free3D zheng2023free3d and ours on real vehicle images, demonstrating our large performance gain.
  • Figure 2: The overall pipeline of Drive-1-to-3. First, it processes a single vehicle image from on-board cameras, virtually rotating it to a shared orbital pose. The object-centric image cropped with a constant focal length is fed to a pose-conditioned diffusion model, which performs occlusion-aware training in latent space with a symmetric prior.
  • Figure 3: Illustration of two strategies in object-centric image cropping -- varying object scales vs. varying focal lengths.
  • Figure 4: Qualitative comparison of our method with AutoRF muller2022autorf, DisCoScene xu2023discoscene and Free3D zheng2023free3d on real vehicle images.
  • Figure 5: Qualitative results showing benefits of (a) occlusion handling and (b) symmetric prior.
  • ...and 10 more figures