Table of Contents
Fetching ...

Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene

Tai-Yu Pan, Sooyoung Jeon, Mengdi Fan, Jinsu Yoo, Zhenyang Feng, Mark Campbell, Kilian Q. Weinberger, Bharath Hariharan, Wei-Lun Chao

TL;DR

The paper tackles occlusion and sensing limits in ego-centric autonomous driving by proposing Transfer Your Perspective (TYP), a conditional diffusion framework to synthesize reference-view LiDAR conditioned on ego data. It employs a two-stage training regime: first learning $P(x|y)$ from real ego-centric data with semantic conditioning, then grounding generation to a reference viewpoint with an adapter to realize $P(x_r|x_e,y_r)$, aided by domain-adaptation to bridge simulated and real domains. The authors demonstrate that generated reference data can substitute real collaborative data, enabling scalable CAV development through datasets like ColWaymo and effective pre-training for collaborative perception backbones across real and semi-synthetic domains. This approach promises substantial reductions in data collection effort while expanding the scope and robustness of multi-agent perception systems, with demonstrated gains in both synthetic and real-world contexts.

Abstract

Self-driving cars relying solely on ego-centric perception face limitations in sensing, often failing to detect occluded, faraway objects. Collaborative autonomous driving (CAV) seems like a promising direction, but collecting data for development is non-trivial. It requires placing multiple sensor-equipped agents in a real-world driving scene, simultaneously! As such, existing datasets are limited in locations and agents. We introduce a novel surrogate to the rescue, which is to generate realistic perception from different viewpoints in a driving scene, conditioned on a real-world sample - the ego-car's sensory data. This surrogate has huge potential: it could potentially turn any ego-car dataset into a collaborative driving one to scale up the development of CAV. We present the very first solution, using a combination of simulated collaborative data and real ego-car data. Our method, Transfer Your Perspective (TYP), learns a conditioned diffusion model whose output samples are not only realistic but also consistent in both semantics and layouts with the given ego-car data. Empirical results demonstrate TYP's effectiveness in aiding in a CAV setting. In particular, TYP enables us to (pre-)train collaborative perception algorithms like early and late fusion with little or no real-world collaborative data, greatly facilitating downstream CAV applications.

Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in a Driving Scene

TL;DR

The paper tackles occlusion and sensing limits in ego-centric autonomous driving by proposing Transfer Your Perspective (TYP), a conditional diffusion framework to synthesize reference-view LiDAR conditioned on ego data. It employs a two-stage training regime: first learning from real ego-centric data with semantic conditioning, then grounding generation to a reference viewpoint with an adapter to realize , aided by domain-adaptation to bridge simulated and real domains. The authors demonstrate that generated reference data can substitute real collaborative data, enabling scalable CAV development through datasets like ColWaymo and effective pre-training for collaborative perception backbones across real and semi-synthetic domains. This approach promises substantial reductions in data collection effort while expanding the scope and robustness of multi-agent perception systems, with demonstrated gains in both synthetic and real-world contexts.

Abstract

Self-driving cars relying solely on ego-centric perception face limitations in sensing, often failing to detect occluded, faraway objects. Collaborative autonomous driving (CAV) seems like a promising direction, but collecting data for development is non-trivial. It requires placing multiple sensor-equipped agents in a real-world driving scene, simultaneously! As such, existing datasets are limited in locations and agents. We introduce a novel surrogate to the rescue, which is to generate realistic perception from different viewpoints in a driving scene, conditioned on a real-world sample - the ego-car's sensory data. This surrogate has huge potential: it could potentially turn any ego-car dataset into a collaborative driving one to scale up the development of CAV. We present the very first solution, using a combination of simulated collaborative data and real ego-car data. Our method, Transfer Your Perspective (TYP), learns a conditioned diffusion model whose output samples are not only realistic but also consistent in both semantics and layouts with the given ego-car data. Empirical results demonstrate TYP's effectiveness in aiding in a CAV setting. In particular, TYP enables us to (pre-)train collaborative perception algorithms like early and late fusion with little or no real-world collaborative data, greatly facilitating downstream CAV applications.

Paper Structure

This paper contains 22 sections, 9 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Illustration of the proposed problem and solution, Transfer Your Perspective (TYP). (a) A given sensory data captured by the ego-car (red triangle). (b) A generated sensory data by TYP, seeing from the viewpoint of another vehicle (green triangle) in the same scene. (c) A generated sensory data, seeing from an imaginary static agent like roadside units (blue icon). (d) Putting all the sensory data together, given or generated, TYP enables the development of collaborative perception with little or no real collaborative driving data.
  • Figure 2: Illustration of TYP's conditioned generative model and training process. We propose a two-stage training procedure. The first stage maximizes the generation capability by conditioning solely on object locations (using real single-agent target data), while the second stage grounds the generation on the ego-car's perspective to match semantics and layouts (using simulated CAV data). Additionally, we introduce a discriminator to adapt simulated CAV features to the real target domain, making the trained model readily applicable to the target domain after the second stage.
  • Figure 3: Qualitative results on enhancement in the target domain. Generated point cloud (green) has better quality with the enhancement given ego (gray) from Waymo (cf.\ref{['sec:adapt']}).
  • Figure 4: Qualitative results on Collaborative Waymo. The gray point clouds are from the original single-agent dataset and the green are generated by TYP conditioning on them.
  • Figure 5: Qualitative results. Our proposed TYP is capable of scene editing, by inputting the same point cloud but different object locations. We (b) remove and (c) add a car from (a) the original point cloud. Inspired by the idea of past traversals you2022hindsight, we apply completely different traffic conditions and generate (d), (e), and (f), to imagine driving through the same intersection.
  • ...and 3 more figures