Table of Contents
Fetching ...

NoiseController: Towards Consistent Multi-view Video Generation via Noise Decomposition and Collaboration

Haotian Dong, Xin Wang, Di Lin, Yipeng Wu, Qin Chen, Ruonan Liu, Kairui Yang, Ping Li, Qing Guo

TL;DR

NoiseController tackles the challenge of spatiotemporal consistency in multi-view video generation by introducing a multi-level noise decomposition that separates scene-level background/foreground noises into shared and residual components, coupled with multi-frame noise collaboration via inter-view and intra-view matrices. This global noise modeling is paired with a joint denoising stage using two parallel U-Nets to separately refine background and foreground noises, enabling more controllable, diverse, and consistent video outputs. Empirical results on nuScenes show state-of-the-art performance in FVD and FID, with further gains when integrated into existing diffusion-based systems and improvements in downstream BEV perception tasks. The approach demonstrates the benefit of combining global noise collaboration with targeted denoising to enhance multi-view video quality and applicability to perception pipelines.

Abstract

High-quality video generation is crucial for many fields, including the film industry and autonomous driving. However, generating videos with spatiotemporal consistencies remains challenging. Current methods typically utilize attention mechanisms or modify noise to achieve consistent videos, neglecting global spatiotemporal information that could help ensure spatial and temporal consistency during video generation. In this paper, we propose the NoiseController, consisting of Multi-Level Noise Decomposition, Multi-Frame Noise Collaboration, and Joint Denoising, to enhance spatiotemporal consistencies in video generation. In multi-level noise decomposition, we first decompose initial noises into scene-level foreground/background noises, capturing distinct motion properties to model multi-view foreground/background variations. Furthermore, each scene-level noise is further decomposed into individual-level shared and residual components. The shared noise preserves consistency, while the residual component maintains diversity. In multi-frame noise collaboration, we introduce an inter-view spatiotemporal collaboration matrix and an intra-view impact collaboration matrix , which captures mutual cross-view effects and historical cross-frame impacts to enhance video quality. The joint denoising contains two parallel denoising U-Nets to remove each scene-level noise, mutually enhancing video generation. We evaluate our NoiseController on public datasets focusing on video generation and downstream tasks, demonstrating its state-of-the-art performance.

NoiseController: Towards Consistent Multi-view Video Generation via Noise Decomposition and Collaboration

TL;DR

NoiseController tackles the challenge of spatiotemporal consistency in multi-view video generation by introducing a multi-level noise decomposition that separates scene-level background/foreground noises into shared and residual components, coupled with multi-frame noise collaboration via inter-view and intra-view matrices. This global noise modeling is paired with a joint denoising stage using two parallel U-Nets to separately refine background and foreground noises, enabling more controllable, diverse, and consistent video outputs. Empirical results on nuScenes show state-of-the-art performance in FVD and FID, with further gains when integrated into existing diffusion-based systems and improvements in downstream BEV perception tasks. The approach demonstrates the benefit of combining global noise collaboration with targeted denoising to enhance multi-view video quality and applicability to perception pipelines.

Abstract

High-quality video generation is crucial for many fields, including the film industry and autonomous driving. However, generating videos with spatiotemporal consistencies remains challenging. Current methods typically utilize attention mechanisms or modify noise to achieve consistent videos, neglecting global spatiotemporal information that could help ensure spatial and temporal consistency during video generation. In this paper, we propose the NoiseController, consisting of Multi-Level Noise Decomposition, Multi-Frame Noise Collaboration, and Joint Denoising, to enhance spatiotemporal consistencies in video generation. In multi-level noise decomposition, we first decompose initial noises into scene-level foreground/background noises, capturing distinct motion properties to model multi-view foreground/background variations. Furthermore, each scene-level noise is further decomposed into individual-level shared and residual components. The shared noise preserves consistency, while the residual component maintains diversity. In multi-frame noise collaboration, we introduce an inter-view spatiotemporal collaboration matrix and an intra-view impact collaboration matrix , which captures mutual cross-view effects and historical cross-frame impacts to enhance video quality. The joint denoising contains two parallel denoising U-Nets to remove each scene-level noise, mutually enhancing video generation. We evaluate our NoiseController on public datasets focusing on video generation and downstream tasks, demonstrating its state-of-the-art performance.

Paper Structure

This paper contains 16 sections, 10 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Overview of NoiseController. In (a) multi-level noise decomposition, we use a decomposer to decompose the initial noises into scene-level background ($\bf B$) and foreground ($\bf F$) noises and individual-level noises by scaling with different coefficients, where each noise consists of individual-level shared ($\mathrm{S}$) and residual ($\mathrm{R}$) components. In (b) multi-frame noise collaboration, we respect the inter-view and intra-view collaboration matrices, yielding the shared components of scene-level noises for the following frames. The concatenation of the collaborated noise which is composed of shared components and randomly sampled residual noises, and the outputs of (c) condition control are fed into (d) joint denoising to predict $6$-view noises at $t^{th}$ denoising step.
  • Figure 2: Detailed architecture of multi-level noise decomposition. We decompose the initial noise $\epsilon_{m,n}$ into scene-level masked background noise $\bf N_{m,n}^{\bf B}$ and foreground noise $\bf N_{m,n}^{{\bf F}}$, following the distributions of $\mathcal{N}(\textbf{0}, {\bf I})$. We further decompose scene-level masked noises $\bf N_{m,n}^\mathbb{D}$ into individual-level masked shared components $\bf N_{m,n}^{\mathbb{D}_\mathrm{S}}$ and residual components $\bf N_{m,n}^{\mathbb{D}_\mathrm{R}}$.
  • Figure 3: Detailed architecture of multi-frame noise collaboration. We respect $6$-view noises of preceding $K$ frames to compute the shared components $\epsilon_{m, n+1}^{\mathbb{D}_{\mathrm{S}}}$. The concatenation of the product of $6$-view noises and inter-view spatiotemporal collaboration matrix is multiplied by the intra-view impact collaboration matrix. We sum the preceding $K$-frame collaborations to achieve shared components of scene-level noises $\epsilon_{m,n+1}^{\mathbb{D}_{\mathrm{S}}}$, which are then combined with sampled residual components $\epsilon_{m,n+1}^{\mathbb{D}_{\mathrm{R}}}$, yielding $6$-view noises at $(n+1)^{th}$ frame.
  • Figure 4: The illustration of noise masking. We utilize scene-level $6$-view masks to mask background/foreground noises.
  • Figure 5: Detailed architecture of joint denoising network. The $6$-view scene-level background and foreground noises $\epsilon_{t}^{\mathbb{D}}$ are masked by ${\bf M}^{\mathbb{D}}$ that is mapped from the 3D object box. The masked noises ${\bf N}_{t}^{\mathbb{D}}$ are added to the latent feature maps $x_0$ based on the spatial layout following SD rombach2022high. Taking inputs as the noisy latent feature maps $z_{t}$, two parallel denoising U-Nets are used to predict scene-level noises $\tilde{\epsilon}_{t}^{\mathbb D}$. Then we mask $\tilde{\epsilon}_{t}^{\mathbb D}$ by ${\bf M}_n^{\mathbb{D}}$ to obtain the predicted masked noises $\hat{\bf N}_{t}^{\mathbb{D}}$, which are finally combined to yield the final predicted noises $\epsilon_{t}^{'}$.
  • ...and 5 more figures