Table of Contents
Fetching ...

VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning

Baolu Li, Yiming Zhang, Qinghe Wang, Liqian Ma, Xiaoyu Shi, Xintao Wang, Pengfei Wan, Zhenfei Yin, Yunzhi Zhuge, Huchuan Lu, Xu Jia

TL;DR

VFXMaster reframes video-style visual effects generation as in-context learning from a reference video, enabling a single unified model to imitate diverse effects and generalize to unseen categories. It introduces an in-context conditioning framework with a dedicated attention mask to prevent content leakage, and an efficient one-shot adaptation using concept-enhancing tokens to boost OOD performance. Trained on a large, diverse VFX dataset and evaluated with both conventional metrics (FVD, EOS, EFS, CLS) and a VLM-based VFX-Cons score, the method demonstrates strong in-domain fidelity and superior generalization to out-of-domain effects, outperforming existing approaches. The work also provides detailed training/inference protocols, ablations, and a plan to release code, models, and data to support future research, aiming to make scalable, generalizable VFX generation practical for creators.

Abstract

Visual effects (VFX) are crucial to the expressive power of digital media, yet their creation remains a major challenge for generative AI. Prevailing methods often rely on the one-LoRA-per-effect paradigm, which is resource-intensive and fundamentally incapable of generalizing to unseen effects, thus limiting scalability and creation. To address this challenge, we introduce VFXMaster, the first unified, reference-based framework for VFX video generation. It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content. In addition, it demonstrates remarkable generalization to unseen effect categories. Specifically, we design an in-context conditioning strategy that prompts the model with a reference example. An in-context attention mask is designed to precisely decouple and inject the essential effect attributes, allowing a single unified model to master the effect imitation without information leakage. In addition, we propose an efficient one-shot effect adaptation mechanism to boost generalization capability on tough unseen effects from a single user-provided video rapidly. Extensive experiments demonstrate that our method effectively imitates various categories of effect information and exhibits outstanding generalization to out-of-domain effects. To foster future research, we will release our code, models, and a comprehensive dataset to the community.

VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning

TL;DR

VFXMaster reframes video-style visual effects generation as in-context learning from a reference video, enabling a single unified model to imitate diverse effects and generalize to unseen categories. It introduces an in-context conditioning framework with a dedicated attention mask to prevent content leakage, and an efficient one-shot adaptation using concept-enhancing tokens to boost OOD performance. Trained on a large, diverse VFX dataset and evaluated with both conventional metrics (FVD, EOS, EFS, CLS) and a VLM-based VFX-Cons score, the method demonstrates strong in-domain fidelity and superior generalization to out-of-domain effects, outperforming existing approaches. The work also provides detailed training/inference protocols, ablations, and a plan to release code, models, and data to support future research, aiming to make scalable, generalizable VFX generation practical for creators.

Abstract

Visual effects (VFX) are crucial to the expressive power of digital media, yet their creation remains a major challenge for generative AI. Prevailing methods often rely on the one-LoRA-per-effect paradigm, which is resource-intensive and fundamentally incapable of generalizing to unseen effects, thus limiting scalability and creation. To address this challenge, we introduce VFXMaster, the first unified, reference-based framework for VFX video generation. It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content. In addition, it demonstrates remarkable generalization to unseen effect categories. Specifically, we design an in-context conditioning strategy that prompts the model with a reference example. An in-context attention mask is designed to precisely decouple and inject the essential effect attributes, allowing a single unified model to master the effect imitation without information leakage. In addition, we propose an efficient one-shot effect adaptation mechanism to boost generalization capability on tough unseen effects from a single user-provided video rapidly. Extensive experiments demonstrate that our method effectively imitates various categories of effect information and exhibits outstanding generalization to out-of-domain effects. To foster future research, we will release our code, models, and a comprehensive dataset to the community.

Paper Structure

This paper contains 29 sections, 2 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Overview of VFXMaster. 1) During training, we randomly sample two prompt-video pairs with the same visual effects as reference and target respectively. By sharing the same 3D VAE and text encoder, the reference part and the target part are landed into the same latent space. We concatenate them along the token dimension as a unified token sequence and feed into the DiT blocks. 2) We design an attention mask to manage information flow to focus on the visual effect of the reference and prevent information leakage. 3) For the tough Out-of-Domain (OOD) samples, we propose an efficient one-shot effect adaptation process to train the concept-enhance tokens for improving the generalization capability.
  • Figure 2: In-Domain Comparison. Qualitative comparison of ours with VFXCreator liu2025vfx and OminiEffects mao2025omni on the OpenVFX dataset. CogVideoX* refers to CogVideoX after supervised fine-tuning on our VFX dataset. All human portraits used in the experiment are AI-generated, and this applies to all subsequent images.
  • Figure 3: Out-of-Domain Comparison.
  • Figure 4: Examples of the "Invisible" and "Soul Jump" visual effects using VFXMaster.
  • Figure 5: Examples of the "Freezing" and "Blazing" visual effects using VFXMaster.
  • ...and 10 more figures