Table of Contents
Fetching ...

VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving

Yibo Liu, Zheyuan Yang, Guile Wu, Yuan Ren, Kejian Lin, Bingbing Liu, Yang Liu, Jinjun Shan

TL;DR

VQA-Diff addresses the challenge of zero-shot 3D vehicle asset generation from in-the-wild images by bridging a Visual Question Answering (VQA) model with diffusion models. The framework uses VQA to extract rich, real-world vehicle knowledge and converts it into multi-view structural prompts via multi-expert diffusion models, while appearance is rendered with a subject-driven, structure-conditioned diffusion process guided by a ControlNet edge cue and a raw image. The approach avoids requiring large-scale image-to-3D training data and demonstrates superior performance across Pascal 3D+, Waymo, and Objaverse, with ablations confirming the benefits of the multi-expert design. Extensions for diverse asset generation and explicit limitations on generalizing to non-vehicle objects are discussed, highlighting practical impact for autonomous driving simulation, data augmentation, and sim2real research.

Abstract

Generating 3D vehicle assets from in-the-wild observations is crucial to autonomous driving. Existing image-to-3D methods cannot well address this problem because they learn generation merely from image RGB information without a deeper understanding of in-the-wild vehicles (such as car models, manufacturers, etc.). This leads to their poor zero-shot prediction capability to handle real-world observations with occlusion or tricky viewing angles. To solve this problem, in this work, we propose VQA-Diff, a novel framework that leverages in-the-wild vehicle images to create photorealistic 3D vehicle assets for autonomous driving. VQA-Diff exploits the real-world knowledge inherited from the Large Language Model in the Visual Question Answering (VQA) model for robust zero-shot prediction and the rich image prior knowledge in the Diffusion model for structure and appearance generation. In particular, we utilize a multi-expert Diffusion Models strategy to generate the structure information and employ a subject-driven structure-controlled generation mechanism to model appearance information. As a result, without the necessity to learn from a large-scale image-to-3D vehicle dataset collected from the real world, VQA-Diff still has a robust zero-shot image-to-novel-view generation ability. We conduct experiments on various datasets, including Pascal 3D+, Waymo, and Objaverse, to demonstrate that VQA-Diff outperforms existing state-of-the-art methods both qualitatively and quantitatively.

VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving

TL;DR

VQA-Diff addresses the challenge of zero-shot 3D vehicle asset generation from in-the-wild images by bridging a Visual Question Answering (VQA) model with diffusion models. The framework uses VQA to extract rich, real-world vehicle knowledge and converts it into multi-view structural prompts via multi-expert diffusion models, while appearance is rendered with a subject-driven, structure-conditioned diffusion process guided by a ControlNet edge cue and a raw image. The approach avoids requiring large-scale image-to-3D training data and demonstrates superior performance across Pascal 3D+, Waymo, and Objaverse, with ablations confirming the benefits of the multi-expert design. Extensions for diverse asset generation and explicit limitations on generalizing to non-vehicle objects are discussed, highlighting practical impact for autonomous driving simulation, data augmentation, and sim2real research.

Abstract

Generating 3D vehicle assets from in-the-wild observations is crucial to autonomous driving. Existing image-to-3D methods cannot well address this problem because they learn generation merely from image RGB information without a deeper understanding of in-the-wild vehicles (such as car models, manufacturers, etc.). This leads to their poor zero-shot prediction capability to handle real-world observations with occlusion or tricky viewing angles. To solve this problem, in this work, we propose VQA-Diff, a novel framework that leverages in-the-wild vehicle images to create photorealistic 3D vehicle assets for autonomous driving. VQA-Diff exploits the real-world knowledge inherited from the Large Language Model in the Visual Question Answering (VQA) model for robust zero-shot prediction and the rich image prior knowledge in the Diffusion model for structure and appearance generation. In particular, we utilize a multi-expert Diffusion Models strategy to generate the structure information and employ a subject-driven structure-controlled generation mechanism to model appearance information. As a result, without the necessity to learn from a large-scale image-to-3D vehicle dataset collected from the real world, VQA-Diff still has a robust zero-shot image-to-novel-view generation ability. We conduct experiments on various datasets, including Pascal 3D+, Waymo, and Objaverse, to demonstrate that VQA-Diff outperforms existing state-of-the-art methods both qualitatively and quantitatively.
Paper Structure (14 sections, 22 figures, 6 tables)

This paper contains 14 sections, 22 figures, 6 tables.

Figures (22)

  • Figure 1: Previous methods learn to generate novel views using image RGB information in a natural space or a latent space, resulting in poor zero-shot prediction capability to handle in-the-wild vehicle observations with occlusion or tricky viewing angles. Our method, VQA-Diff, tackles this problem by exploiting the robust zero-shot prediction ability of the Visual Question Answering (VQA) model and the rich structure and appearance generation ability of Diffusion Models. This helps to create consistent and photorealistic multi-view renderings of any unseen vehicle in the wild.
  • Figure 2: The framework of the proposed VQA-Diff. The VQA model first generates a prompt containing detailed key information regarding the model, manufacturer, production year, and main features of the vehicle. Then, multi-expert DMs adopt the prompt to create multi-view structures of the vehicle. Finally, the subject-driven structure-controlled generation with ControlNet renders the multi-view structures into photorealistic novel views with controllable poses. The photorealistic novel views can be utilized in various downstream tasks, including the creation of 3D assets with the GS/NeRF representation and training data augmentation. It can also be applied in a simulation environment for autonomous driving.
  • Figure 3: A comparison of the processes for dealing with the image-to-novel-view problem of previous methods and the proposed VQA-Diff.
  • Figure 4: A illustration of the question design. We tune the question based on the feedback from Stable Diffusion Stable.
  • Figure 5: An illustration of transferring real-world knowledge and image prior of a pretrained DM Stable into multi-view structure generation.
  • ...and 17 more figures