Table of Contents
Fetching ...

BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations

Weixi Feng, Chao Liu, Sifei Liu, William Yang Wang, Arash Vahdat, Weili Nie

TL;DR

This paper addresses the challenge of controllability and compositionality in text-to-video generation by introducing blob video representations as a grounding primitive. BlobGEN-Vid is a model-agnostic diffusion framework that attaches per-object blob parameters and captions to video frames through masked 3D self-attention and masked spatial cross-attention, plus a context interpolation module and an LLM-based blob-layout generator. The authors evaluate on layout-grounded and text-to-video benchmarks, showing improved layout controllability (mIOU) and prompt alignment (CLIP) and demonstrating strong compositional performance, even surpassing some proprietary systems when combined with GPT-4o for blob planning. The work provides a scalable, modular approach to grounded video synthesis with practical benefits for zero-shot generation and multi-view consistency.

Abstract

Existing video generation models struggle to follow complex text prompts and synthesize multiple objects, raising the need for additional grounding input for improved controllability. In this work, we propose to decompose videos into visual primitives - blob video representation, a general representation for controllable video generation. Based on blob conditions, we develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. In particular, we introduce a masked 3D attention module that effectively improves regional consistency across frames. In addition, we introduce a learnable module to interpolate text embeddings so that users can control semantics in specific frames and obtain smooth object transitions. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models. Extensive experimental results show that BlobGEN-Vid achieves superior zero-shot video generation ability and state-of-the-art layout controllability on multiple benchmarks. When combined with an LLM for layout planning, our framework even outperforms proprietary text-to-video generators in terms of compositional accuracy.

BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations

TL;DR

This paper addresses the challenge of controllability and compositionality in text-to-video generation by introducing blob video representations as a grounding primitive. BlobGEN-Vid is a model-agnostic diffusion framework that attaches per-object blob parameters and captions to video frames through masked 3D self-attention and masked spatial cross-attention, plus a context interpolation module and an LLM-based blob-layout generator. The authors evaluate on layout-grounded and text-to-video benchmarks, showing improved layout controllability (mIOU) and prompt alignment (CLIP) and demonstrating strong compositional performance, even surpassing some proprietary systems when combined with GPT-4o for blob planning. The work provides a scalable, modular approach to grounded video synthesis with practical benefits for zero-shot generation and multi-view consistency.

Abstract

Existing video generation models struggle to follow complex text prompts and synthesize multiple objects, raising the need for additional grounding input for improved controllability. In this work, we propose to decompose videos into visual primitives - blob video representation, a general representation for controllable video generation. Based on blob conditions, we develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. In particular, we introduce a masked 3D attention module that effectively improves regional consistency across frames. In addition, we introduce a learnable module to interpolate text embeddings so that users can control semantics in specific frames and obtain smooth object transitions. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models. Extensive experimental results show that BlobGEN-Vid achieves superior zero-shot video generation ability and state-of-the-art layout controllability on multiple benchmarks. When combined with an LLM for layout planning, our framework even outperforms proprietary text-to-video generators in terms of compositional accuracy.
Paper Structure (38 sections, 5 equations, 20 figures, 6 tables)

This paper contains 38 sections, 5 equations, 20 figures, 6 tables.

Figures (20)

  • Figure 1: With blob video representations, BlobGEN-Vid can support fine-grained controllability in text-to-video generation in terms of motion control, camera control, numerical accuracy and attribute transition. Blobs in top two rows are extracted from a video and a 3D scene, respectively, using the pre-trained segmentation model and image captioning model, while blobs in bottom two rows are generated by GPT-4o with the given global prompt as input.
  • Figure 2: Blob video representations for video generation consist of blob parameters and blob descriptions. Blob parameters exist for every frame while blob descriptions are provided in every $k$ frames. Therefore, only frames $1, k+1,...$ have blob descriptions.
  • Figure 3: BlobGEN-Vid architecture with U-Net backbone or DiT backbone. Our method leverages two masked attention modules that allows: 1) visual features to attend to only corresponding blobs embeddings; 2) the same object attend to itself across frames. High-value elements in the 3D attention mask in the figure will be mapped to 0 while low-value elements are mapped to $-\infty$ as in Eq. \ref{['eq:hwt_mask']}. Note the multiple colors in the binary 3D attention mask are from the aliasing issue during visualization.
  • Figure 4: Layout-to-video generation results on YoutubeVIS-2021 vis2021. The visualized layouts are ground truth layouts fed into the models during inference. Our method shows better prompt-video alignment than the strongest baseline TrackDiffusion li2023trackdiffusion.
  • Figure 5: Qualitative results on ScanNet++, where our method, especially with masked 3D attention, shows much better consistency in the door appearance than BlobGEN-3D.
  • ...and 15 more figures