Table of Contents
Fetching ...

VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning

Li-Heng Chen, Ke Cheng, Yahui Liu, Lei Shi, Shi-Sheng Huang, Hongbo Fu

Abstract

Driving video generation has achieved much progress in controllability, video resolution, and length, but fails to support fine-grained object-level controllability for diverse driving videos, while preserving the spatiotemporal consistency, especially in long video generation. In this paper, we present a new driving video generation technique, called VistaGEN, which enables fine-grained control of specific entities, including 3D objects, images, and text descriptions, while maintaining spatiotemporal consistency in long video sequences. Our key innovation is the incorporation of multiview visual-language reasoning into the long driving video generation. To this end, we inject visual-language features into a multiview video generator to enable fine-grained controllability. More importantly, we propose a multiview vision-language evaluator (MV-VLM) to intelligently and automatically evaluate spatiotemporal consistency of the generated content, thus formulating a novel generation-evaluation-regeneration closed-loop generation mechanism. This mechanism ensures high-quality, coherent outputs, facilitating the creation of complex and reliable driving scenarios. Besides, within the closed-loop generation, we introduce an object-level refinement module to refine the unsatisfied results evaluated from the MV-VLM and then feed them back to the video generator for regeneration. Extensive evaluation shows that our VistaGEN achieves diverse driving video generation results with fine-grained controllability, especially for long-tail objects, and much better spatiotemporal consistency than previous approaches.

VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning

Abstract

Driving video generation has achieved much progress in controllability, video resolution, and length, but fails to support fine-grained object-level controllability for diverse driving videos, while preserving the spatiotemporal consistency, especially in long video generation. In this paper, we present a new driving video generation technique, called VistaGEN, which enables fine-grained control of specific entities, including 3D objects, images, and text descriptions, while maintaining spatiotemporal consistency in long video sequences. Our key innovation is the incorporation of multiview visual-language reasoning into the long driving video generation. To this end, we inject visual-language features into a multiview video generator to enable fine-grained controllability. More importantly, we propose a multiview vision-language evaluator (MV-VLM) to intelligently and automatically evaluate spatiotemporal consistency of the generated content, thus formulating a novel generation-evaluation-regeneration closed-loop generation mechanism. This mechanism ensures high-quality, coherent outputs, facilitating the creation of complex and reliable driving scenarios. Besides, within the closed-loop generation, we introduce an object-level refinement module to refine the unsatisfied results evaluated from the MV-VLM and then feed them back to the video generator for regeneration. Extensive evaluation shows that our VistaGEN achieves diverse driving video generation results with fine-grained controllability, especially for long-tail objects, and much better spatiotemporal consistency than previous approaches.

Paper Structure

This paper contains 13 sections, 9 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: This paper introduces a new driving video generation technique, VistaGEN, which enables fine-grained control, like specifying multiple objects' visual-language conditions (roadmaps, text descriptors and visual appearance in (a) and (c), where the index numbers match the corresponding roadmaps in the right), with high-quality and coherent video generation outputs ((b) and (d)). Moreover, our VistaGEN can achieve spatiotemporal consistent generation (e) with both multiview consistency (colored red and orange in the left) and long-range temporal consistency (colored pink and cyan across different frames respectively) in the long video sequences. Please refer to our demo video for more details.
  • Figure 2: The pipeline of VistaGEN. Given frame descriptors as input, we hierarchically inject these signals as global and local scene control guidance via visual-language feature fusion for fine-grained controllability. Subsequently, we perform driving video generation via a multiview video generator $\mathcal{D}$. The generation is automatically evaluated by a multiview vision-language evaluator $\mathcal{E}$, followed by an object-level refinement module $\mathcal{R}$. This formulates a closed-loop mechanism to maintain spatiotemporal consistency for long video sequence generation.
  • Figure 3: The illustration of MV-VLM structure.
  • Figure 4: The workflow of our object-level refinement module.
  • Figure 5: Qualitative results of fine-grained video generation by VistaGEN in complex scenes. Our model seamlessly integrates multimodal conditions, including BEV layouts, textual descriptions, and reference images. The colored bounding boxes track specific controlled instances across the generated multi-view videos, demonstrating precise object-level controllability.
  • ...and 8 more figures