Table of Contents
Fetching ...

PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement

Teng Hu, Zhentao Yu, Zhengguang Zhou, Jiangning Zhang, Yuan Zhou, Qinglin Lu, Ran Yi

TL;DR

PolyVivid addresses the challenge of controllable, identity-consistent multi-subject video generation by integrating a VLLM-based text-image fusion module, a 3D-RoPE-based identity-interaction enhancement, and an attention-inherited identity injection mechanism, all built atop an MLLM-driven data construction pipeline. By grounding subject images in text space and enabling structured cross-modal interaction, it achieves improved identity fidelity, text–video alignment, and video realism across complex multi-subject interactions. Ablation studies validate the individual contributions and show that the full system yields superior performance over existing open-source and commercial baselines. The work has practical implications for personalized content creation and multi-subject video production, while acknowledging limitations and societal considerations discussed in the appendix.

Abstract

Despite recent advances in video generation, existing models still lack fine-grained controllability, especially for multi-subject customization with consistent identity and interaction. In this paper, we propose PolyVivid, a multi-subject video customization framework that enables flexible and identity-consistent generation. To establish accurate correspondences between subject images and textual entities, we design a VLLM-based text-image fusion module that embeds visual identities into the textual space for precise grounding. To further enhance identity preservation and subject interaction, we propose a 3D-RoPE-based enhancement module that enables structured bidirectional fusion between text and image embeddings. Moreover, we develop an attention-inherited identity injection module to effectively inject fused identity features into the video generation process, mitigating identity drift. Finally, we construct an MLLM-based data pipeline that combines MLLM-based grounding, segmentation, and a clique-based subject consolidation strategy to produce high-quality multi-subject data, effectively enhancing subject distinction and reducing ambiguity in downstream video generation. Extensive experiments demonstrate that PolyVivid achieves superior performance in identity fidelity, video realism, and subject alignment, outperforming existing open-source and commercial baselines.

PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement

TL;DR

PolyVivid addresses the challenge of controllable, identity-consistent multi-subject video generation by integrating a VLLM-based text-image fusion module, a 3D-RoPE-based identity-interaction enhancement, and an attention-inherited identity injection mechanism, all built atop an MLLM-driven data construction pipeline. By grounding subject images in text space and enabling structured cross-modal interaction, it achieves improved identity fidelity, text–video alignment, and video realism across complex multi-subject interactions. Ablation studies validate the individual contributions and show that the full system yields superior performance over existing open-source and commercial baselines. The work has practical implications for personalized content creation and multi-subject video production, while acknowledging limitations and societal considerations discussed in the appendix.

Abstract

Despite recent advances in video generation, existing models still lack fine-grained controllability, especially for multi-subject customization with consistent identity and interaction. In this paper, we propose PolyVivid, a multi-subject video customization framework that enables flexible and identity-consistent generation. To establish accurate correspondences between subject images and textual entities, we design a VLLM-based text-image fusion module that embeds visual identities into the textual space for precise grounding. To further enhance identity preservation and subject interaction, we propose a 3D-RoPE-based enhancement module that enables structured bidirectional fusion between text and image embeddings. Moreover, we develop an attention-inherited identity injection module to effectively inject fused identity features into the video generation process, mitigating identity drift. Finally, we construct an MLLM-based data pipeline that combines MLLM-based grounding, segmentation, and a clique-based subject consolidation strategy to produce high-quality multi-subject data, effectively enhancing subject distinction and reducing ambiguity in downstream video generation. Extensive experiments demonstrate that PolyVivid achieves superior performance in identity fidelity, video realism, and subject alignment, outperforming existing open-source and commercial baselines.

Paper Structure

This paper contains 21 sections, 7 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: PolyVivid can generate high-quality customized videos from multiple subject images and a text prompt, which ensures a high subject similarity and good subject interaction specified by the text.
  • Figure 2: Framework of our PolyVivid: the text prompt and reference image are fused by the VLLM-based text-image fusion module. Then, a 3D RoPE-based identity-interaction enhancement module is employed to enhance the text-image interaction. The enhanced image tokens are injected by an MM cross-attention module, which helps preserve the identities while ensuring good subject interaction.
  • Figure 3: Comparison of the condition injection strategies for MM-DiT.
  • Figure 4: Comparison on multi-subject video customization.
  • Figure 5: Examples of the test set, which contains images from diverse categories, such as human, animal, man-made machine, food, goods, and building.
  • ...and 5 more figures