Table of Contents
Fetching ...

TransText: Alpha-as-RGB Representation for Transparent Text Animation

Fei Zhang, Zijian Zhou, Bohao Tang, Sen He, Hang Li, Zhe Wang, Soubhik Sanyal, Pengfei Liu, Viktar Atliha, Tao Xiang, Frost Xu, Semih Gunel

Abstract

We introduce the first method, to the best of our knowledge, for adapting image-to-video models to layer-aware text (glyph) animation, a capability critical for practical dynamic visual design. Existing approaches predominantly handle the transparency-encoding (alpha channel) as an extra latent dimension appended to the RGB space, necessitating the reconstruction of the underlying RGB-centric variational autoencoder (VAE). However, given the scarcity of high-quality transparent glyph data, retraining the VAE is computationally expensive and may erode the robust semantic priors learned from massive RGB corpora, potentially leading to latent pattern mixing. To mitigate these limitations, we propose TransText, a framework based on a novel Alpha-as-RGB paradigm to jointly model appearance and transparency without modifying the pre-trained generative manifold. TransText embeds the alpha channel as an RGB-compatible visual signal through latent spatial concatenation, explicitly ensuring strict cross-modal (RGB-and-Alpha) consistency while preventing feature entanglement. Our experiments demonstrate that TransText significantly outperforms baselines, generating coherent, high-fidelity transparent animations with diverse, fine-grained effects.

TransText: Alpha-as-RGB Representation for Transparent Text Animation

Abstract

We introduce the first method, to the best of our knowledge, for adapting image-to-video models to layer-aware text (glyph) animation, a capability critical for practical dynamic visual design. Existing approaches predominantly handle the transparency-encoding (alpha channel) as an extra latent dimension appended to the RGB space, necessitating the reconstruction of the underlying RGB-centric variational autoencoder (VAE). However, given the scarcity of high-quality transparent glyph data, retraining the VAE is computationally expensive and may erode the robust semantic priors learned from massive RGB corpora, potentially leading to latent pattern mixing. To mitigate these limitations, we propose TransText, a framework based on a novel Alpha-as-RGB paradigm to jointly model appearance and transparency without modifying the pre-trained generative manifold. TransText embeds the alpha channel as an RGB-compatible visual signal through latent spatial concatenation, explicitly ensuring strict cross-modal (RGB-and-Alpha) consistency while preventing feature entanglement. Our experiments demonstrate that TransText significantly outperforms baselines, generating coherent, high-fidelity transparent animations with diverse, fine-grained effects.
Paper Structure (15 sections, 6 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 15 sections, 6 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Visual frame-wise results of our image-to-video (I2V) RGBA glyph animation model. TransText generates diverse and complex glyph animations with accurate, clean transparency while strictly preserving the style of the reference image. See the Supplementary Materials for additional video results.
  • Figure 1: Visualization of original samples from our glyph animation dataset, with one example shown for each visual effect. The reference image (the column marked in blue) corresponds to the middle frame of each video clip. Please zoom in for a better view.
  • Figure 2: Overview of the TransText pipeline. We obtain input latents by encoding the RGB video and the RGB-projected Alpha video (Alpha-as-RGB) via the VAE. These latents are spatially concatenated for alignment. Additionally, the reference image and its derived trimap serve as structural conditions to guide the joint generation of both RGB textures and $\alpha$ mattes. During training, in addition to the standard velocity prediction loss $\mathcal{L}_{\mathrm{mse}}$, we introduce an $\alpha$-oriented reconstruction term $\mathcal{L}_{\mathrm{rec}}$. Reconstruction loss performs one-step denoising using the predicted velocity to reconstruct the clean latent state, and explicitly aligns the reconstructed $\alpha$ with the ground-truth matte, thereby significantly improving fine-grained transparency generation.
  • Figure 2: Ablation studies on each designed module. Bold indicates the best performance, and underlined indicates the second best.
  • Figure 2: Visualized results of different $\alpha$-oriented attention masking mechanisms. Clearly, CrossAttnM improves transparency estimation and video quality, while SelfAttnM harms RGB–$\alpha$ alignment and generation fidelity.
  • ...and 4 more figures