Table of Contents
Fetching ...

Matte Anything: Interactive Natural Image Matting with Segment Anything Models

Jingfeng Yao, Xinggang Wang, Lang Ye, Wenyu Liu

TL;DR

MatAny tackles the labor-intensive trimap requirement in natural image matting by composing SAM for segmentation, GroundingDINO for transparency detection, and a pre-trained matting model (ViTMatte) to produce alpha mattes from automatically generated pseudo-trimaps. The approach requires no task-specific training and achieves state-of-the-art performance among trimap-free methods while remaining competitive with trimap-guided methods on Composition-1k, with notable improvements in MSE and SAD. It also demonstrates strong zero-shot generalization to real-world AIM-500 and to task-specific AM-2k and P3M datasets, and supports multiple-instance matting via simple user interactions. A key limitation is the computational cost of SAM, suggesting future work toward lighter SAM-like models without sacrificing matting quality.

Abstract

Natural image matting algorithms aim to predict the transparency map (alpha-matte) with the trimap guidance. However, the production of trimap often requires significant labor, which limits the widespread application of matting algorithms on a large scale. To address the issue, we propose Matte Anything (MatAny), an interactive natural image matting model that could produce high-quality alpha-matte with various simple hints. The key insight of MatAny is to generate pseudo trimap automatically with contour and transparency prediction. In our work, we leverage vision foundation models to enhance the performance of natural image matting. Specifically, we use the segment anything model to predict high-quality contour with user interaction and an open-vocabulary detector to predict the transparency of any object. Subsequently, a pre-trained image matting model generates alpha mattes with pseudo trimaps. MatAny is the interactive matting algorithm with the most supported interaction methods and the best performance to date. It consists of orthogonal vision models without any additional training. We evaluate the performance of MatAny against several current image matting algorithms. MatAny has 58.3% improvement on MSE and 40.6% improvement on SAD compared to the previous image matting methods with simple guidance, achieving new state-of-the-art (SOTA) performance. The source codes and pre-trained models are available at https://github.com/hustvl/Matte-Anything.

Matte Anything: Interactive Natural Image Matting with Segment Anything Models

TL;DR

MatAny tackles the labor-intensive trimap requirement in natural image matting by composing SAM for segmentation, GroundingDINO for transparency detection, and a pre-trained matting model (ViTMatte) to produce alpha mattes from automatically generated pseudo-trimaps. The approach requires no task-specific training and achieves state-of-the-art performance among trimap-free methods while remaining competitive with trimap-guided methods on Composition-1k, with notable improvements in MSE and SAD. It also demonstrates strong zero-shot generalization to real-world AIM-500 and to task-specific AM-2k and P3M datasets, and supports multiple-instance matting via simple user interactions. A key limitation is the computational cost of SAM, suggesting future work toward lighter SAM-like models without sacrificing matting quality.

Abstract

Natural image matting algorithms aim to predict the transparency map (alpha-matte) with the trimap guidance. However, the production of trimap often requires significant labor, which limits the widespread application of matting algorithms on a large scale. To address the issue, we propose Matte Anything (MatAny), an interactive natural image matting model that could produce high-quality alpha-matte with various simple hints. The key insight of MatAny is to generate pseudo trimap automatically with contour and transparency prediction. In our work, we leverage vision foundation models to enhance the performance of natural image matting. Specifically, we use the segment anything model to predict high-quality contour with user interaction and an open-vocabulary detector to predict the transparency of any object. Subsequently, a pre-trained image matting model generates alpha mattes with pseudo trimaps. MatAny is the interactive matting algorithm with the most supported interaction methods and the best performance to date. It consists of orthogonal vision models without any additional training. We evaluate the performance of MatAny against several current image matting algorithms. MatAny has 58.3% improvement on MSE and 40.6% improvement on SAD compared to the previous image matting methods with simple guidance, achieving new state-of-the-art (SOTA) performance. The source codes and pre-trained models are available at https://github.com/hustvl/Matte-Anything.
Paper Structure (25 sections, 6 equations, 7 figures, 5 tables)

This paper contains 25 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Comparison between Matte Anything and previous methods. Matte Anything utilizes vision foundational models, such as Segment Anything Models SAM, Open Vocabulary Detection Models groundingdino, etc., achieving interactively simple and high-quality natural image matting.
  • Figure 2: RGB Image and Trimap. The trimap divides an image into foreground (FG), background (BG), and an unknown region (UK). Algorithms based on the trimap only need to predict the unknown region. This reduces the algorithmic complexity but increases the interactive cost.
  • Figure 3: Overall architecture of Matte Anything model (MatAny). Its core idea is to interactively utilize visual backbone models to generate a pseudo-trimap, which is then fed into a pre-trained trimap-based model to produce high-quality matting results. "OVD" denotes Open-Vocabulary Detectors, "SAM" denotes Segment Anything Models.
  • Figure 4: Performance on Synthetic Images.
  • Figure 5: Zero-Shot Performance on Real World Images.
  • ...and 2 more figures