Table of Contents
Fetching ...

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

Zhaoyang Li, Dongjun Qian, Kai Su, Qishuai Diao, Xiangyang Xia, Chang Liu, Wenfei Yang, Tianzhu Zhang, Zehuan Yuan

TL;DR

BindWeave tackles subject-consistent video generation by grounding complex prompts with a Multimodal Large Language Model to produce subject-aware hidden states that condition a Diffusion Transformer. It introduces two conditioning streams—MLLM-derived relational features and CLIP-based identity cues—plus VAE-based fine-detail conditioning to preserve subject fidelity and dynamic interactions across single- and multi-subject scenes. Training on OpenS2V-5M with a two-stage curriculum and 50-step rectified-flow inference yields state-of-the-art performance on OpenS2V-Eval, notably improving NexusScore and text–video relevance. This approach enhances controllability for personalized content, branding, and virtual production workflows, with reproducible results and open-source potential.

Abstract

Diffusion Transformer has shown remarkable abilities in generating high-fidelity videos, delivering visually coherent frames and rich details over extended durations. However, existing video generation models still fall short in subject-consistent video generation due to an inherent difficulty in parsing prompts that specify complex spatial relationships, temporal logic, and interactions among multiple subjects. To address this issue, we propose BindWeave, a unified framework that handles a broad range of subject-to-video scenarios from single-subject cases to complex multi-subject scenes with heterogeneous entities. To bind complex prompt semantics to concrete visual subjects, we introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities and disentangle roles, attributes, and interactions, yielding subject-aware hidden states that condition the diffusion transformer for high-fidelity subject-consistent video generation. Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos, outperforming existing open-source and commercial models.

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

TL;DR

BindWeave tackles subject-consistent video generation by grounding complex prompts with a Multimodal Large Language Model to produce subject-aware hidden states that condition a Diffusion Transformer. It introduces two conditioning streams—MLLM-derived relational features and CLIP-based identity cues—plus VAE-based fine-detail conditioning to preserve subject fidelity and dynamic interactions across single- and multi-subject scenes. Training on OpenS2V-5M with a two-stage curriculum and 50-step rectified-flow inference yields state-of-the-art performance on OpenS2V-Eval, notably improving NexusScore and text–video relevance. This approach enhances controllability for personalized content, branding, and virtual production workflows, with reproducible results and open-source potential.

Abstract

Diffusion Transformer has shown remarkable abilities in generating high-fidelity videos, delivering visually coherent frames and rich details over extended durations. However, existing video generation models still fall short in subject-consistent video generation due to an inherent difficulty in parsing prompts that specify complex spatial relationships, temporal logic, and interactions among multiple subjects. To address this issue, we propose BindWeave, a unified framework that handles a broad range of subject-to-video scenarios from single-subject cases to complex multi-subject scenes with heterogeneous entities. To bind complex prompt semantics to concrete visual subjects, we introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities and disentangle roles, attributes, and interactions, yielding subject-aware hidden states that condition the diffusion transformer for high-fidelity subject-consistent video generation. Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos, outperforming existing open-source and commercial models.

Paper Structure

This paper contains 21 sections, 13 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Examples of subject-to-video generation results of our proposed BindWeave, demonstrating its ability to produce high-fidelity, subject-consistent videos across a broad spectrum of scenarios from single-subject inputs to complex multi-subject compositions.
  • Figure 2: Framework of our method. A multimodal large language model performs cross‑modal reasoning to ground entities and disentangle roles, attributes, and interactions from the prompt and optional reference images. The resulting subject‑aware signals condition a Diffusion Transformer through cross‑attention and lightweight adapters, guiding identity‑faithful, relation‑consistent, and temporally coherent video generation.
  • Figure 3: Illustration of our adaptive multi-reference conditioning strategy.
  • Figure 4: Qualitative comparison on subject-to-video task, with four uniformly sampled frames shown in each case. Compared to other competing methods, our approach is superior in subject fidelity, naturalness, and semantic consistency with the caption.
  • Figure 5: Qualitative comparison on subject-to-video task, with four uniformly sampled frames shown in each case. Compared with other methods, our approach better avoids implausible phenomena and produces more natural videos while maintaining strong subject consistency.
  • ...and 5 more figures