Table of Contents
Fetching ...

Dual-Context Aggregation for Universal Image Matting

Qinglin Liu, Xiaoqian Lv, Wei Yu, Changyong Guo, Shengping Zhang

TL;DR

DCAM tackles universal image matting by combining global contour understanding with local boundary refinement in a dual-context aggregation framework. It comprises a semantic backbone to extract features, a dual-context aggregation network featuring global object and local appearance aggregators, and a matting decoder to estimate the alpha matte $\alpha$. Across five matting datasets, DCAM achieves state-of-the-art results in both automatic and interactive settings, demonstrating strong universality and robustness to diverse guidance and objects. The study underscores the value of integrating coarse global structure with fine-grained local detail for reliable matting in real-world scenarios.

Abstract

Natural image matting aims to estimate the alpha matte of the foreground from a given image. Various approaches have been explored to address this problem, such as interactive matting methods that use guidance such as click or trimap, and automatic matting methods tailored to specific objects. However, existing matting methods are designed for specific objects or guidance, neglecting the common requirement of aggregating global and local contexts in image matting. As a result, these methods often encounter challenges in accurately identifying the foreground and generating precise boundaries, which limits their effectiveness in unforeseen scenarios. In this paper, we propose a simple and universal matting framework, named Dual-Context Aggregation Matting (DCAM), which enables robust image matting with arbitrary guidance or without guidance. Specifically, DCAM first adopts a semantic backbone network to extract low-level features and context features from the input image and guidance. Then, we introduce a dual-context aggregation network that incorporates global object aggregators and local appearance aggregators to iteratively refine the extracted context features. By performing both global contour segmentation and local boundary refinement, DCAM exhibits robustness to diverse types of guidance and objects. Finally, we adopt a matting decoder network to fuse the low-level features and the refined context features for alpha matte estimation. Experimental results on five matting datasets demonstrate that the proposed DCAM outperforms state-of-the-art matting methods in both automatic matting and interactive matting tasks, which highlights the strong universality and high performance of DCAM. The source code is available at \url{https://github.com/Windaway/DCAM}.

Dual-Context Aggregation for Universal Image Matting

TL;DR

DCAM tackles universal image matting by combining global contour understanding with local boundary refinement in a dual-context aggregation framework. It comprises a semantic backbone to extract features, a dual-context aggregation network featuring global object and local appearance aggregators, and a matting decoder to estimate the alpha matte . Across five matting datasets, DCAM achieves state-of-the-art results in both automatic and interactive settings, demonstrating strong universality and robustness to diverse guidance and objects. The study underscores the value of integrating coarse global structure with fine-grained local detail for reliable matting in real-world scenarios.

Abstract

Natural image matting aims to estimate the alpha matte of the foreground from a given image. Various approaches have been explored to address this problem, such as interactive matting methods that use guidance such as click or trimap, and automatic matting methods tailored to specific objects. However, existing matting methods are designed for specific objects or guidance, neglecting the common requirement of aggregating global and local contexts in image matting. As a result, these methods often encounter challenges in accurately identifying the foreground and generating precise boundaries, which limits their effectiveness in unforeseen scenarios. In this paper, we propose a simple and universal matting framework, named Dual-Context Aggregation Matting (DCAM), which enables robust image matting with arbitrary guidance or without guidance. Specifically, DCAM first adopts a semantic backbone network to extract low-level features and context features from the input image and guidance. Then, we introduce a dual-context aggregation network that incorporates global object aggregators and local appearance aggregators to iteratively refine the extracted context features. By performing both global contour segmentation and local boundary refinement, DCAM exhibits robustness to diverse types of guidance and objects. Finally, we adopt a matting decoder network to fuse the low-level features and the refined context features for alpha matte estimation. Experimental results on five matting datasets demonstrate that the proposed DCAM outperforms state-of-the-art matting methods in both automatic matting and interactive matting tasks, which highlights the strong universality and high performance of DCAM. The source code is available at \url{https://github.com/Windaway/DCAM}.
Paper Structure (21 sections, 13 equations, 5 figures, 7 tables)

This paper contains 21 sections, 13 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Overview of the Dual-Context Aggregation Matting (DCAM) framework. A semantic backbone network first extracts low-level features and context features from the input image and guidance. Then, a dual-context aggregation network iteratively performs global object aggregation and local appearance aggregation to refine the extracted context features. Finally, a matting decoder network fuses the low-level features with the refined context features to predict the alpha matte.
  • Figure 2: Structures of the global object aggregator and the local appearance aggregator. The global object aggregator utilizes semantic-object attention to perform global contour refinement, while the local appearance aggregator adopts a hybrid transformer structure that utilizes both low-frequency and high-frequency context to perform local segmentation refinement.
  • Figure 3: Qualitative results on the HIM-100K dataset. The red dots denote the click guidance.
  • Figure 4: Qualitative results on the Adobe Composition-1K dataset.
  • Figure 5: Qualitative results on the Distinctions-646 dataset. All methods are trained on the Adobe Composition-1K dataset.