BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration
Zhaoyang Li, Dongjun Qian, Kai Su, Qishuai Diao, Xiangyang Xia, Chang Liu, Wenfei Yang, Tianzhu Zhang, Zehuan Yuan
TL;DR
BindWeave tackles subject-consistent video generation by grounding complex prompts with a Multimodal Large Language Model to produce subject-aware hidden states that condition a Diffusion Transformer. It introduces two conditioning streams—MLLM-derived relational features and CLIP-based identity cues—plus VAE-based fine-detail conditioning to preserve subject fidelity and dynamic interactions across single- and multi-subject scenes. Training on OpenS2V-5M with a two-stage curriculum and 50-step rectified-flow inference yields state-of-the-art performance on OpenS2V-Eval, notably improving NexusScore and text–video relevance. This approach enhances controllability for personalized content, branding, and virtual production workflows, with reproducible results and open-source potential.
Abstract
Diffusion Transformer has shown remarkable abilities in generating high-fidelity videos, delivering visually coherent frames and rich details over extended durations. However, existing video generation models still fall short in subject-consistent video generation due to an inherent difficulty in parsing prompts that specify complex spatial relationships, temporal logic, and interactions among multiple subjects. To address this issue, we propose BindWeave, a unified framework that handles a broad range of subject-to-video scenarios from single-subject cases to complex multi-subject scenes with heterogeneous entities. To bind complex prompt semantics to concrete visual subjects, we introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities and disentangle roles, attributes, and interactions, yielding subject-aware hidden states that condition the diffusion transformer for high-fidelity subject-consistent video generation. Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos, outperforming existing open-source and commercial models.
