Table of Contents
Fetching ...

OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation

Hao Yu, Jiabo Zhan, Zile Wang, Jinglin Wang, Huaisong Zhang, Hongyu Li, Xinrui Chen, Yongxian Wei, Chun Yuan

TL;DR

OmniAlpha introduces a unified, multi-task sequence-to-sequence framework for RGB-alpha (RGBA) image generation and editing. It combines an End-to-End Alpha-aware VAE initialized from RGB priors and a Diffusion Transformer backbone augmented with MSRoPE-BiL to process multiple RGBA layers concurrently, trained on 21 tasks via the AlphaLayers dataset. The approach achieves state-of-the-art performance across matting, layer decomposition, and layer-conditioned generation benchmarks, including dramatic SAD improvements in mask-free matting and high human preference for layer-conditioned outputs. This work demonstrates that a single, alpha-aware model can learn a rich, shared RGBA representation, enabling versatile, layer-aware generative capabilities with practical impact in VFX, design, and compositing workflows.

Abstract

Generative models have excelled in RGB synthesis, but real-world applications require RGBA manipulation. This has led to a fragmented landscape: specialized, single-task models handle alpha but lack versatility, while unified multi-task frameworks are confined to the RGB domain. To bridge this critical gap, we propose OmniAlpha, the first unified, multi-task generative framework for sequence-to-sequence RGBA image generation and editing. Its architecture features MSRoPE-BiL, a novel RoPE method with a bi-directionally extendable layer axis for its Diffusion Transformer (DiT) backbone, enabling the concurrent processing of multiple input and target RGBA layers. To power this framework, we introduce AlphaLayers, a new dataset of 1,000 high-quality, multi-layer triplets, built via a novel automated synthesis and filter pipeline. Jointly training OmniAlpha on this dataset across a comprehensive suite of 21 diverse tasks, extensive experiments demonstrate that our unified approach consistently outperforms strong, specialized baselines. Most notably, OmniAlpha achieves a dramatic 84.8% relative reduction in SAD for mask-free matting on AIM-500 and wins over 90% of human preferences in layer-conditioned completion. Our work proves that a unified, multi-task model can learn a superior shared representation for RGBA, paving the way for more powerful, layer-aware generative systems.

OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation

TL;DR

OmniAlpha introduces a unified, multi-task sequence-to-sequence framework for RGB-alpha (RGBA) image generation and editing. It combines an End-to-End Alpha-aware VAE initialized from RGB priors and a Diffusion Transformer backbone augmented with MSRoPE-BiL to process multiple RGBA layers concurrently, trained on 21 tasks via the AlphaLayers dataset. The approach achieves state-of-the-art performance across matting, layer decomposition, and layer-conditioned generation benchmarks, including dramatic SAD improvements in mask-free matting and high human preference for layer-conditioned outputs. This work demonstrates that a single, alpha-aware model can learn a rich, shared RGBA representation, enabling versatile, layer-aware generative capabilities with practical impact in VFX, design, and compositing workflows.

Abstract

Generative models have excelled in RGB synthesis, but real-world applications require RGBA manipulation. This has led to a fragmented landscape: specialized, single-task models handle alpha but lack versatility, while unified multi-task frameworks are confined to the RGB domain. To bridge this critical gap, we propose OmniAlpha, the first unified, multi-task generative framework for sequence-to-sequence RGBA image generation and editing. Its architecture features MSRoPE-BiL, a novel RoPE method with a bi-directionally extendable layer axis for its Diffusion Transformer (DiT) backbone, enabling the concurrent processing of multiple input and target RGBA layers. To power this framework, we introduce AlphaLayers, a new dataset of 1,000 high-quality, multi-layer triplets, built via a novel automated synthesis and filter pipeline. Jointly training OmniAlpha on this dataset across a comprehensive suite of 21 diverse tasks, extensive experiments demonstrate that our unified approach consistently outperforms strong, specialized baselines. Most notably, OmniAlpha achieves a dramatic 84.8% relative reduction in SAD for mask-free matting on AIM-500 and wins over 90% of human preferences in layer-conditioned completion. Our work proves that a unified, multi-task model can learn a superior shared representation for RGBA, paving the way for more powerful, layer-aware generative systems.

Paper Structure

This paper contains 30 sections, 15 equations, 79 figures, 6 tables.

Figures (79)

  • Figure 1: Demonstrating OmniAlpha's versatility across a range of RGBA tasks. Our unified model handles: text-to-image generation (Row 1); layer decomposition and mask-conditioned matting (Row 2); referring and automatic matting (Row 3); and layer-conditioned completion (Row 4), along with other tasks described in the main text.
  • Figure 2: Overview of the OmniAlpha Diffusion Transformer architecture. Conditioned on a task instruction and $n$ RGBA images, the model simultaneously denoises $m$ target images. We employ 3D MSRoPE for positional encoding, which treats the layer axis as a z-index to effectively process multiple layers concurrently.
  • Figure 3: Dataset preparation pipeline. We construct the multi-layer dataset using Qwen3-VL qwen3technicalreport as the core vision-language model, Qwen-Image-Edit wu2025qwenimagetechnicalreport, and ObjectClear zhao2025objectclearcompleteobjectremoval as domain-specific expert models.
  • Figure 4: Mask Generation Pipeline. Starting from the foreground image, we get a tuple of masks in various forms.
  • Figure 5: Isolate a clear foreground with defined edges and accurate transparency.
  • ...and 74 more figures