Table of Contents
Fetching ...

Compositional Video Generation as Flow Equalization

Xingyi Yang, Xinchao Wang

TL;DR

This work tackles the challenge of compositional fidelity in diffusion-based text-to-video generation by introducing Vico, a framework that equalizes the influence of input tokens. It keypoints at the Spatial-Temporal Attention Flow (ST-Flow) to attribute video output to text tokens across space and time, and replaces intractable exact max-flow with a differentiable subgraph-based approximation using min-max path flows. A test-time optimization procedure updates the noisy latent to balance these token flows, enabling more faithful representations of complex prompts. Across multiple base video models, Vico yields notable improvements in compositional richness and semantic accuracy, with the soft min-max variant often delivering the best trade-off between fidelity and optimization stability, suggesting broad applicability for compositional control in video diffusion models.

Abstract

Large-scale Text-to-Video (T2V) diffusion models have recently demonstrated unprecedented capability to transform natural language descriptions into stunning and photorealistic videos. Despite the promising results, a significant challenge remains: these models struggle to fully grasp complex compositional interactions between multiple concepts and actions. This issue arises when some words dominantly influence the final video, overshadowing other concepts.To tackle this problem, we introduce \textbf{Vico}, a generic framework for compositional video generation that explicitly ensures all concepts are represented properly. At its core, Vico analyzes how input tokens influence the generated video, and adjusts the model to prevent any single concept from dominating. Specifically, Vico extracts attention weights from all layers to build a spatial-temporal attention graph, and then estimates the influence as the \emph{max-flow} from the source text token to the video target token. Although the direct computation of attention flow in diffusion models is typically infeasible, we devise an efficient approximation based on subgraph flows and employ a fast and vectorized implementation, which in turn makes the flow computation manageable and differentiable. By updating the noisy latent to balance these flows, Vico captures complex interactions and consequently produces videos that closely adhere to textual descriptions. We apply our method to multiple diffusion-based video models for compositional T2V and video editing. Empirical results demonstrate that our framework significantly enhances the compositional richness and accuracy of the generated videos. Visit our website at~\href{https://adamdad.github.io/vico/}{\url{https://adamdad.github.io/vico/}}.

Compositional Video Generation as Flow Equalization

TL;DR

This work tackles the challenge of compositional fidelity in diffusion-based text-to-video generation by introducing Vico, a framework that equalizes the influence of input tokens. It keypoints at the Spatial-Temporal Attention Flow (ST-Flow) to attribute video output to text tokens across space and time, and replaces intractable exact max-flow with a differentiable subgraph-based approximation using min-max path flows. A test-time optimization procedure updates the noisy latent to balance these token flows, enabling more faithful representations of complex prompts. Across multiple base video models, Vico yields notable improvements in compositional richness and semantic accuracy, with the soft min-max variant often delivering the best trade-off between fidelity and optimization stability, suggesting broad applicability for compositional control in video diffusion models.

Abstract

Large-scale Text-to-Video (T2V) diffusion models have recently demonstrated unprecedented capability to transform natural language descriptions into stunning and photorealistic videos. Despite the promising results, a significant challenge remains: these models struggle to fully grasp complex compositional interactions between multiple concepts and actions. This issue arises when some words dominantly influence the final video, overshadowing other concepts.To tackle this problem, we introduce \textbf{Vico}, a generic framework for compositional video generation that explicitly ensures all concepts are represented properly. At its core, Vico analyzes how input tokens influence the generated video, and adjusts the model to prevent any single concept from dominating. Specifically, Vico extracts attention weights from all layers to build a spatial-temporal attention graph, and then estimates the influence as the \emph{max-flow} from the source text token to the video target token. Although the direct computation of attention flow in diffusion models is typically infeasible, we devise an efficient approximation based on subgraph flows and employ a fast and vectorized implementation, which in turn makes the flow computation manageable and differentiable. By updating the noisy latent to balance these flows, Vico captures complex interactions and consequently produces videos that closely adhere to textual descriptions. We apply our method to multiple diffusion-based video models for compositional T2V and video editing. Empirical results demonstrate that our framework significantly enhances the compositional richness and accuracy of the generated videos. Visit our website at~\href{https://adamdad.github.io/vico/}{\url{https://adamdad.github.io/vico/}}.
Paper Structure (21 sections, 7 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 21 sections, 7 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Examples for compositional video generation of Vico on top of VideCrafterv2 chen2024videocrafter2. We identify four types of typical failure in compositional T2V (Row 1) Missing Subject (Row 2) Spatial Confusion (Row 3) Semantic Leakage and (Row 4) Motion Mixing. Vico provides a unified solution to these issues by equalizing the contributions of text tokens.
  • Figure 2: Overall pipeline of our Vico. Before each denoising step, Vico extracts attention maps from each layer to build a spatiotemporal graph. We calculate the attribution scores as max-flow in the graph and adjust the noisy latent code to balance this flows.
  • Figure 3: Attribution heatmap comparison between DAAM and ST-Flow.
  • Figure 4: Qualitative comparison of the videos generated by VideoCrafterv2 baseline, Attribute&Excite and our Vico with compositional textual descriptions.
  • Figure 5: Video edit results with compositional prompts.
  • ...and 1 more figures