Table of Contents
Fetching ...

TransAnimate: Taming Layer Diffusion to Generate RGBA Video

Xuewei Chen, Zhimin Chen, Yiren Song

TL;DR

TransAnimate addresses the scarcity and integration challenges of RGBA video generation by unifying transparency modeling with motion-aware diffusion and leveraging pre-trained RGB and transparent backbones. It builds a three-dataset pipeline (high-quality game effects, foreground objects, and synthetic transparent motions) and introduces positive-trigger data augmentation to improve edge fidelity and motion coherence. The method inflates 2D image models to 5D video tensors, incorporates a motion module, and adapts RGB-based controllable signals (via SparseCtrl adapters) for pixel-precise RGBA control, enabling robust game-effect generation. Experimental results show strong qualitative and quantitative performance, with effective controllability and data-efficient learning, making RGBA video creation more practical for gaming and visual effects pipelines.

Abstract

Text-to-video generative models have made remarkable advancements in recent years. However, generating RGBA videos with alpha channels for transparency and visual effects remains a significant challenge due to the scarcity of suitable datasets and the complexity of adapting existing models for this purpose. To address these limitations, we present TransAnimate, an innovative framework that integrates RGBA image generation techniques with video generation modules, enabling the creation of dynamic and transparent videos. TransAnimate efficiently leverages pre-trained text-to-transparent image model weights and combines them with temporal models and controllability plugins trained on RGB videos, adapting them for controllable RGBA video generation tasks. Additionally, we introduce an interactive motion-guided control mechanism, where directional arrows define movement and colors adjust scaling, offering precise and intuitive control for designing game effects. To further alleviate data scarcity, we have developed a pipeline for creating an RGBA video dataset, incorporating high-quality game effect videos, extracted foreground objects, and synthetic transparent videos. Comprehensive experiments demonstrate that TransAnimate generates high-quality RGBA videos, establishing it as a practical and effective tool for applications in gaming and visual effects.

TransAnimate: Taming Layer Diffusion to Generate RGBA Video

TL;DR

TransAnimate addresses the scarcity and integration challenges of RGBA video generation by unifying transparency modeling with motion-aware diffusion and leveraging pre-trained RGB and transparent backbones. It builds a three-dataset pipeline (high-quality game effects, foreground objects, and synthetic transparent motions) and introduces positive-trigger data augmentation to improve edge fidelity and motion coherence. The method inflates 2D image models to 5D video tensors, incorporates a motion module, and adapts RGB-based controllable signals (via SparseCtrl adapters) for pixel-precise RGBA control, enabling robust game-effect generation. Experimental results show strong qualitative and quantitative performance, with effective controllability and data-efficient learning, making RGBA video creation more practical for gaming and visual effects pipelines.

Abstract

Text-to-video generative models have made remarkable advancements in recent years. However, generating RGBA videos with alpha channels for transparency and visual effects remains a significant challenge due to the scarcity of suitable datasets and the complexity of adapting existing models for this purpose. To address these limitations, we present TransAnimate, an innovative framework that integrates RGBA image generation techniques with video generation modules, enabling the creation of dynamic and transparent videos. TransAnimate efficiently leverages pre-trained text-to-transparent image model weights and combines them with temporal models and controllability plugins trained on RGB videos, adapting them for controllable RGBA video generation tasks. Additionally, we introduce an interactive motion-guided control mechanism, where directional arrows define movement and colors adjust scaling, offering precise and intuitive control for designing game effects. To further alleviate data scarcity, we have developed a pipeline for creating an RGBA video dataset, incorporating high-quality game effect videos, extracted foreground objects, and synthetic transparent videos. Comprehensive experiments demonstrate that TransAnimate generates high-quality RGBA videos, establishing it as a practical and effective tool for applications in gaming and visual effects.

Paper Structure

This paper contains 14 sections, 5 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: RGBA Video Generation with TransAnimate. By utilizing pre-trained text-to-transparent image models, the motion-guided control mechanism, and the proposed dataset, TransAnimate enables high quality generation and effective control of video content.
  • Figure 2: Framework Overview. TransAnimate generates transparent videos by learning motion patterns from videos. A frozen Transparent Encoder extracts features, refined by Temporal Attention and Linear Layers. Pre-trained SparseCtrl weights enable control via motion, sketches, and RGB images. A frozen Transparent Decoder reconstructs transparent frames, enhancing generation with limited RGBA data.
  • Figure 3: Illustration of TransAnimate. Our dataset consists of (a) Animate Dataset with 3,000 high-quality game effect videos, (b) Foreground Object Videos Dataset with 7,000 segmented videos capturing diverse motion patterns, and (c) Synthesized Transparent Motion Videos with 20,000 generated samples featuring controlled motion transformations. For synthesized dataset, from top to bottom, it represents Motion Caption, Raw Image, Synthetic, and Motion Control.
  • Figure 4: Text-to-RGBA video generation results of TransAnimate.
  • Figure 5: Conditional generation results from TransAnimate. The qualitative results are results with sketch, depth, and RGB image conditions. The input conditions are displayed on the left, while the keyframes guided by these conditions are highlighted with blue borders.
  • ...and 3 more figures