Matting by Generation

Zhixiang Wang; Baiang Li; Jian Wang; Yu-Lun Liu; Jinwei Gu; Yung-Yu Chuang; Shin'ichi Satoh

Matting by Generation

Zhixiang Wang, Baiang Li, Jian Wang, Yu-Lun Liu, Jinwei Gu, Yung-Yu Chuang, Shin'ichi Satoh

TL;DR

Image matting is modeled as a highly ill-posed problem C = αF + (1−α)B. This work reframes matting as conditional generation using a latent diffusion prior, enabling high-resolution, detail-rich mattes by leveraging a pre-trained diffusion model and a generative formulation. It supports both guidance-free and guidance-based matting, including text and spatial cues, through a patch-based high-resolution inference strategy guided by low-resolution mattes. Experiments on three real-world benchmarks show quantitative improvements and visually faithful boundaries, confirming the utility of latent-diffusion priors for matting. While diffusion-based inference is slower than regression, the approach offers a flexible, scalable, and effective matting paradigm with strong practical impact for editing and compositing.

Abstract

This paper introduces an innovative approach for image matting that redefines the traditional regression-based task as a generative modeling challenge. Our method harnesses the capabilities of latent diffusion models, enriched with extensive pre-trained knowledge, to regularize the matting process. We present novel architectural innovations that empower our model to produce mattes with superior resolution and detail. The proposed method is versatile and can perform both guidance-free and guidance-based image matting, accommodating a variety of additional cues. Our comprehensive evaluation across three benchmark datasets demonstrates the superior performance of our approach, both quantitatively and qualitatively. The results not only reflect our method's robust effectiveness but also highlight its ability to generate visually compelling mattes that approach photorealistic quality. The project page for this paper is available at https://lightchaserx.github.io/matting-by-generation/

Matting by Generation

TL;DR

Abstract

Paper Structure (29 sections, 8 equations, 11 figures, 3 tables)

This paper contains 29 sections, 8 equations, 11 figures, 3 tables.

Introduction
List of Contributions
Related Work
Guidance-based Matting.
Guidance-free Matting.
Diffusion Models.
Method
Generative Formulation
Conditional Generation with a Single Input Image
HR Inference with LR Guidance
Patch Sampling
Patch-Based Inference
Guidance Mechanism
Additional Guidance
Text Guidance
...and 14 more sections

Figures (11)

Figure 1: Imperfect human annotation. The training data are usually either blurry or lacking in some details. Therefore, the regression-based model would overfit the imperfect ground truth.
Figure 2: Method. (a) The low-resolution inference path can be used alone if we do not need very high-quality mattes or have a limited computational budget. The input is the low-resolution latent feature $\mathbf{z}^{(\mathbf{x}\downarrow)}$ of the down-sampled image $\mathbf{x}\downarrow$ and the sampled noise $\boldsymbol{\epsilon}_t$. If there is spatial guidance $c_\mathcal{S}$ present, we will combine it with the sampled noise as the noisy sample. If a text prompt $c_\mathcal{T}$ is provided, we will deliver it to the U-Net. The output of this path is the denoised latent feature $\hat{\mathbf{z}}_0$. This path requires a few steps $T^\prime\sim10$. (c) We run this step multiple times with different random seeds to get $L$ predictions in the pixel space. With them, we estimate the uncertainty map $\mathcal{U}$, and the set of candidate regions $\mathcal{B}=\{b_i\}_1^B$. (b) The high-resolution path. We first add the up-sampled latent feature to the sampled noise. Then, we split the high-resolution latent input and noise into overlapped patches according to $\mathcal{B}$. These patches are respectively fed into the diffusion denoising network. Finally, we merge all denoised patches to get a collage. We perform "split" and "collage" during every denoising step $t\in\{1,\ldots,T\}$. We will use a specific text prompt: "enhance details" if there is a text prompt used in the LR path.
Figure 3: Visual results of trimap-free matting on PPM-100 Ke-2022-MODNet. Our method achieves more accurate matting results, especially around thin and detailed structures, compared to prior work. We extracted the foreground using the technique proposed by Germer et al. germer2021fast and composited it onto a new background sampled from a public background database lin2021real.
Figure 4: Use of guidance. With various guidance, we can reduce ambiguity.
Figure 5: Randomness test. We use 5 different random seeds to test the model on selected images. With the increase of diffusion steps, the mean and std. of SAD error decrease.
...and 6 more figures

Matting by Generation

TL;DR

Abstract

Matting by Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)