Table of Contents
Fetching ...

MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement

Yufan Deng, Yuanyang Yin, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Shenghai Yuan, Yiding Yang, Angtian Wang, Bo Liu, Haibin Huang, Chongyang Ma

TL;DR

The paper tackles any-reference video generation by introducing MAGREF, a unified framework that combines masked guidance with a region-aware masking scheme and pixel-wise channel concatenation to preserve multiple reference identities without architectural changes. A subject disentanglement mechanism explicitly ties text-derived semantics to corresponding visual regions, mitigating cross-subject interference. A four-stage data curation pipeline is proposed to diversify training pairs and suppress copy-paste artifacts. Extensive experiments demonstrate state-of-the-art performance across single-ID and multi-subject benchmarks, with strong identity preservation, subject coherence, and textual alignment. MAGREF thus enables scalable, controllable, and high-fidelity any-reference video synthesis, with prospects for multi-modal and long-video extensions in future work.

Abstract

We tackle the task of any-reference video generation, which aims to synthesize videos conditioned on arbitrary types and combinations of reference subjects, together with textual prompts. This task faces persistent challenges, including identity inconsistency, entanglement among multiple reference subjects, and copy-paste artifacts. To address these issues, we introduce MAGREF, a unified and effective framework for any-reference video generation. Our approach incorporates masked guidance and a subject disentanglement mechanism, enabling flexible synthesis conditioned on diverse reference images and textual prompts. Specifically, masked guidance employs a region-aware masking mechanism combined with pixel-wise channel concatenation to preserve appearance features of multiple subjects along the channel dimension. This design preserves identity consistency and maintains the capabilities of the pre-trained backbone, without requiring any architectural changes. To mitigate subject confusion, we introduce a subject disentanglement mechanism which injects the semantic values of each subject derived from the text condition into its corresponding visual region. Additionally, we establish a four-stage data pipeline to construct diverse training pairs, effectively alleviating copy-paste artifacts. Extensive experiments on a comprehensive benchmark demonstrate that MAGREF consistently outperforms existing state-of-the-art approaches, paving the way for scalable, controllable, and high-fidelity any-reference video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF

MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement

TL;DR

The paper tackles any-reference video generation by introducing MAGREF, a unified framework that combines masked guidance with a region-aware masking scheme and pixel-wise channel concatenation to preserve multiple reference identities without architectural changes. A subject disentanglement mechanism explicitly ties text-derived semantics to corresponding visual regions, mitigating cross-subject interference. A four-stage data curation pipeline is proposed to diversify training pairs and suppress copy-paste artifacts. Extensive experiments demonstrate state-of-the-art performance across single-ID and multi-subject benchmarks, with strong identity preservation, subject coherence, and textual alignment. MAGREF thus enables scalable, controllable, and high-fidelity any-reference video synthesis, with prospects for multi-modal and long-video extensions in future work.

Abstract

We tackle the task of any-reference video generation, which aims to synthesize videos conditioned on arbitrary types and combinations of reference subjects, together with textual prompts. This task faces persistent challenges, including identity inconsistency, entanglement among multiple reference subjects, and copy-paste artifacts. To address these issues, we introduce MAGREF, a unified and effective framework for any-reference video generation. Our approach incorporates masked guidance and a subject disentanglement mechanism, enabling flexible synthesis conditioned on diverse reference images and textual prompts. Specifically, masked guidance employs a region-aware masking mechanism combined with pixel-wise channel concatenation to preserve appearance features of multiple subjects along the channel dimension. This design preserves identity consistency and maintains the capabilities of the pre-trained backbone, without requiring any architectural changes. To mitigate subject confusion, we introduce a subject disentanglement mechanism which injects the semantic values of each subject derived from the text condition into its corresponding visual region. Additionally, we establish a four-stage data pipeline to construct diverse training pairs, effectively alleviating copy-paste artifacts. Extensive experiments on a comprehensive benchmark demonstrate that MAGREF consistently outperforms existing state-of-the-art approaches, paving the way for scalable, controllable, and high-fidelity any-reference video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF

Paper Structure

This paper contains 52 sections, 25 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: We present MAGREF, a flexible video generation framework that supports arbitrary combinations of subjects including humans, animals, clothing, accessories, and environments within a single generation process, while maintaining visual consistency and faithfully following textual instructions. (a) Qualitative results across diverse subjects and scenes, with reference images provided in the top-left corner. More qualitative cases are provided in Figures \ref{['fig:supp_mix']}--\ref{['fig:supp_multi_subject']}. (b) User study comparing MAGREF with existing models. (c) Quantitative comparison for the multi-subject evaluation set.
  • Figure 2: Qualitative results showcasing diverse subjects and scenes, with reference images provided in the first two columns. MAGREF supports a wide range of pairings, including humans with accessories, and fashion items. It reliably identifies the intended objects, even in complex or cluttered reference images, and faithfully follows the text prompt.
  • Figure 3: (a) Overview of MAGREF. We introduce a region-aware masking mechanism to encode multiple references and concatenate them with noise latents. subject disentanglement that links each reference to its textual label to avoid cross-subject entanglement. Compared with (b) Vanilla masking mechanism, which concatenates references along the frame dimension, our (c) Region-aware masking mechanism merges references into a composite image, encodes it with a VAE, and applies a downsampled binary mask to indicate subject regions, thereby better preserving first-frame consistency in I2V models.
  • Figure 4: Cosine similarity visualization between composite reference input image and textual labels. MAGREF achieves more accurate alignment of the Man and the Woman in the multi-subject composite image with the corresponding text prompts. In contrast, removing Subject Disentanglement (SD) results in entangled and ambiguous associations.
  • Figure 5: A systematic four-stage data pipeline for collecting high-quality training samples.
  • ...and 9 more figures