Table of Contents
Fetching ...

Zippo: Zipping Color and Transparency Distributions into a Single Diffusion Model

Kangyang Xie, Binbin Yang, Hao Chen, Meng Wang, Cheng Zou, Hui Xue, Ming Yang, Chunhua Shen

TL;DR

This work presents Zippo, a unified framework for zipping the color and transparency distributions into a single diffusion model by expanding the diffusion latent into a joint representation of RGB images and alpha mattes and proposes a modality-aware noise reassignment strategy to further empower Zippo with jointly generating RGB images and its corresponding alpha mattes under the text guidance.

Abstract

Beyond the superiority of the text-to-image diffusion model in generating high-quality images, recent studies have attempted to uncover its potential for adapting the learned semantic knowledge to visual perception tasks. In this work, instead of translating a generative diffusion model into a visual perception model, we explore to retain the generative ability with the perceptive adaptation. To accomplish this, we present Zippo, a unified framework for zipping the color and transparency distributions into a single diffusion model by expanding the diffusion latent into a joint representation of RGB images and alpha mattes. By alternatively selecting one modality as the condition and then applying the diffusion process to the counterpart modality, Zippo is capable of generating RGB images from alpha mattes and predicting transparency from input images. In addition to single-modality prediction, we propose a modality-aware noise reassignment strategy to further empower Zippo with jointly generating RGB images and its corresponding alpha mattes under the text guidance. Our experiments showcase Zippo's ability of efficient text-conditioned transparent image generation and present plausible results of Matte-to-RGB and RGB-to-Matte translation.

Zippo: Zipping Color and Transparency Distributions into a Single Diffusion Model

TL;DR

This work presents Zippo, a unified framework for zipping the color and transparency distributions into a single diffusion model by expanding the diffusion latent into a joint representation of RGB images and alpha mattes and proposes a modality-aware noise reassignment strategy to further empower Zippo with jointly generating RGB images and its corresponding alpha mattes under the text guidance.

Abstract

Beyond the superiority of the text-to-image diffusion model in generating high-quality images, recent studies have attempted to uncover its potential for adapting the learned semantic knowledge to visual perception tasks. In this work, instead of translating a generative diffusion model into a visual perception model, we explore to retain the generative ability with the perceptive adaptation. To accomplish this, we present Zippo, a unified framework for zipping the color and transparency distributions into a single diffusion model by expanding the diffusion latent into a joint representation of RGB images and alpha mattes. By alternatively selecting one modality as the condition and then applying the diffusion process to the counterpart modality, Zippo is capable of generating RGB images from alpha mattes and predicting transparency from input images. In addition to single-modality prediction, we propose a modality-aware noise reassignment strategy to further empower Zippo with jointly generating RGB images and its corresponding alpha mattes under the text guidance. Our experiments showcase Zippo's ability of efficient text-conditioned transparent image generation and present plausible results of Matte-to-RGB and RGB-to-Matte translation.
Paper Structure (19 sections, 7 equations, 20 figures, 1 table)

This paper contains 19 sections, 7 equations, 20 figures, 1 table.

Figures (20)

  • Figure 1: An illustration of comparison between our proposed Zippo and a separate text-to-image diffusion model and image matting model. (a) text-to-image diffusion model generates a RGB image acccording to the input text prompt; (b) image matting model predict alpha matte from the input RGB image; (c) Zippo actually models the joint distribution of RGB and transparency information, thus acts as a multi-task learner. Specifically, besides the capabilities of RGB-to-Matte and Matte-to-RGB, Zippo can generate a pair of RGB image and alpha matte simultaneously, thus compositing a transparent image.
  • Figure 2: An illustration of the workflow of our proposed Zippo. Given a RGB image $x$ and its corresponding alpha matte $a$, we turns the pre-trained generative diffusion model into a joint distribution learner of color and transparency information, which can perform the tasks of perceptive RGB-to-Matte estimation, Matte-to-RGB generation and jointly generating the paired image and alpha matte. For each task, we learn the distribution identifier for task routing. Taking the task of joint generation as example, we sample two standard Gaussian noises $z_T^*$ and $z_T$ and concatenate then together as the joint latent $z_T^{joint}$. Then Zippo iteratively denoise from noisy $z_T^{joint}$ to produce the clean joint latent $z_0^{joint}$. Finally, we split $z_0^{joint}$ to $z_*^x$ and $z_*^a$ and decode them into the RGB image $x^*$ and alpha matte $a^*$.
  • Figure 3: RGB and Alpha matte share the same VAE encoder and decoder. A VAE trained for encoding RGB images is also sufficient for transparency reconstruction.
  • Figure 4: Results of joint generation on VITON. The first two columns are generated simultaneously from Zippo. To evaluate the alpha mask, we present the result transparency image composed from the first two columns.
  • Figure 5: Results of joint generation on AM2K. The first two columns are generated simultaneously from Zippo. To evaluate the alpha mask, we present the result transparency image composed from the first two column.
  • ...and 15 more figures