MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement
Yufan Deng, Yuanyang Yin, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Shenghai Yuan, Yiding Yang, Angtian Wang, Bo Liu, Haibin Huang, Chongyang Ma
TL;DR
The paper tackles any-reference video generation by introducing MAGREF, a unified framework that combines masked guidance with a region-aware masking scheme and pixel-wise channel concatenation to preserve multiple reference identities without architectural changes. A subject disentanglement mechanism explicitly ties text-derived semantics to corresponding visual regions, mitigating cross-subject interference. A four-stage data curation pipeline is proposed to diversify training pairs and suppress copy-paste artifacts. Extensive experiments demonstrate state-of-the-art performance across single-ID and multi-subject benchmarks, with strong identity preservation, subject coherence, and textual alignment. MAGREF thus enables scalable, controllable, and high-fidelity any-reference video synthesis, with prospects for multi-modal and long-video extensions in future work.
Abstract
We tackle the task of any-reference video generation, which aims to synthesize videos conditioned on arbitrary types and combinations of reference subjects, together with textual prompts. This task faces persistent challenges, including identity inconsistency, entanglement among multiple reference subjects, and copy-paste artifacts. To address these issues, we introduce MAGREF, a unified and effective framework for any-reference video generation. Our approach incorporates masked guidance and a subject disentanglement mechanism, enabling flexible synthesis conditioned on diverse reference images and textual prompts. Specifically, masked guidance employs a region-aware masking mechanism combined with pixel-wise channel concatenation to preserve appearance features of multiple subjects along the channel dimension. This design preserves identity consistency and maintains the capabilities of the pre-trained backbone, without requiring any architectural changes. To mitigate subject confusion, we introduce a subject disentanglement mechanism which injects the semantic values of each subject derived from the text condition into its corresponding visual region. Additionally, we establish a four-stage data pipeline to construct diverse training pairs, effectively alleviating copy-paste artifacts. Extensive experiments on a comprehensive benchmark demonstrate that MAGREF consistently outperforms existing state-of-the-art approaches, paving the way for scalable, controllable, and high-fidelity any-reference video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF
