Table of Contents
Fetching ...

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

Hao Lu, Ziyang Liu, Guangfeng Jiang, Yuanfei Luo, Sheng Chen, Yangang Zhang, Ying-Cong Chen

TL;DR

UniUGP tackles long-tail autonomous driving by unifying three capabilities—understanding, generation, and planning—within a hybrid expert framework that leverages pre-trained VLMs and diffusion-based video generation. It introduces specialized long-tail datasets, a three-expert architecture with a four-stage training regime, and cross-modal losses to align reasoning, trajectories, and future visuals. Empirical results show state-of-the-art performance in perception, reasoning, decision-making, and generation, with strong generalization to challenging scenarios. The work demonstrates the value of integrating linguistic reasoning, visual dynamics, and controllable video synthesis to advance end-to-end autonomous driving.

Abstract

Autonomous driving (AD) systems struggle in long-tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision-language-action (VLA)-based methods cannot leverage unlabeled videos for visual causal learning, while world model-based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding-Generation-Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre-trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi-frame observations and language instructions as input, it produces interpretable chain-of-thought reasoning, physically consistent trajectories, and coherent future videos. We introduce a four-stage training strategy that progressively builds these capabilities across multiple existing AD datasets, along with the proposed specialized datasets. Experiments demonstrate state-of-the-art performance in perception, reasoning, and decision-making, with superior generalization to challenging long-tail situations.

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

TL;DR

UniUGP tackles long-tail autonomous driving by unifying three capabilities—understanding, generation, and planning—within a hybrid expert framework that leverages pre-trained VLMs and diffusion-based video generation. It introduces specialized long-tail datasets, a three-expert architecture with a four-stage training regime, and cross-modal losses to align reasoning, trajectories, and future visuals. Empirical results show state-of-the-art performance in perception, reasoning, decision-making, and generation, with strong generalization to challenging scenarios. The work demonstrates the value of integrating linguistic reasoning, visual dynamics, and controllable video synthesis to advance end-to-end autonomous driving.

Abstract

Autonomous driving (AD) systems struggle in long-tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision-language-action (VLA)-based methods cannot leverage unlabeled videos for visual causal learning, while world model-based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding-Generation-Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre-trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi-frame observations and language instructions as input, it produces interpretable chain-of-thought reasoning, physically consistent trajectories, and coherent future videos. We introduce a four-stage training strategy that progressively builds these capabilities across multiple existing AD datasets, along with the proposed specialized datasets. Experiments demonstrate state-of-the-art performance in perception, reasoning, and decision-making, with superior generalization to challenging long-tail situations.

Paper Structure

This paper contains 25 sections, 9 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Illustration of UniUGP, a unified model with three hybrid experts. The understanding expert performs the next-token prediction for causal reasoning. The planning expert forms a MoT architecture with the understanding expert, and performs the velocity prediction in flow matching for production future actions. The generation expert is cascaded as a world model to produce future videos.
  • Figure 2: Dataset Construction Pipeline. This figure depicts the pipeline of data collection (integrating multiple challenging driving datasets) and data processing (featuring four task categories: understanding, chain-of-thought, planning, and instruction following) to train and assess the cognitive abilities of end-to-end autonomous driving models within a unified QA framework.
  • Figure 3: The ablation experiment on the absence or presence of world model knowledge. The world model enables the VLA to pay more attention to future causal relationships, thereby focusing on the semantics of distant objects.
  • Figure 4: Trajectory controllable generation visualization. We control the generation of future frames of the video by modifying the trajectories fed into the generation model, which demonstrates the controllability of our generation experts.
  • Figure 5: Long-tail perception and understanding of questions and answers.
  • ...and 4 more figures