Table of Contents
Fetching ...

MP-Mat: A 3D-and-Instance-Aware Human Matting and Editing Framework with Multiplane Representation

Siyi Jiao, Wenzheng Zeng, Yerong Li, Huayu Zhang, Changxin Gao, Nong Sang, Mike Zheng Shou

TL;DR

MP-Mat tackles the challenging problem of multi-instance human matting by introducing two multiplane representations: scene geometry-level multiplanes (SG-MP) and instance-level multiplanes (Inst-MP). SG-MP provides a depth-aware, feature-level plane decomposition, while Inst-MP encodes each instance and the background with both color and alpha, enforcing the rendering relation $I = \sum_{i=0}^{S} c_i \alpha_i$ to support efficient instance-level editing. The framework uses a Transformer-based Instance Query and a refinement module guided by uncertainty to produce accurate alpha mats and colors, trained end-to-end with a detection+matting loss. Empirically, MP-Mat achieves state-of-the-art results on HIM-100K and SMPMat for instance matting and demonstrates strong zero-shot performance on instance editing tasks, highlighting its practical value for image editing and complex scene understanding.

Abstract

Human instance matting aims to estimate an alpha matte for each human instance in an image, which is challenging as it easily fails in complex cases requiring disentangling mingled pixels belonging to multiple instances along hairy and thin boundary structures. In this work, we address this by introducing MP-Mat, a novel 3D-and-instance-aware matting framework with multiplane representation, where the multiplane concept is designed from two different perspectives: scene geometry level and instance level. Specifically, we first build feature-level multiplane representations to split the scene into multiple planes based on depth differences. This approach makes the scene representation 3D-aware, and can serve as an effective clue for splitting instances in different 3D positions, thereby improving interpretability and boundary handling ability especially in occlusion areas. Then, we introduce another multiplane representation that splits the scene in an instance-level perspective, and represents each instance with both matte and color. We also treat background as a special instance, which is often overlooked by existing methods. Such an instance-level representation facilitates both foreground and background content awareness, and is useful for other down-stream tasks like image editing. Once built, the representation can be reused to realize controllable instance-level image editing with high efficiency. Extensive experiments validate the clear advantage of MP-Mat in matting task. We also demonstrate its superiority in image editing tasks, an area under-explored by existing matting-focused methods, where our approach under zero-shot inference even outperforms trained specialized image editing techniques by large margins. Code is open-sourced at https://github.com/JiaoSiyi/MPMat.git}.

MP-Mat: A 3D-and-Instance-Aware Human Matting and Editing Framework with Multiplane Representation

TL;DR

MP-Mat tackles the challenging problem of multi-instance human matting by introducing two multiplane representations: scene geometry-level multiplanes (SG-MP) and instance-level multiplanes (Inst-MP). SG-MP provides a depth-aware, feature-level plane decomposition, while Inst-MP encodes each instance and the background with both color and alpha, enforcing the rendering relation to support efficient instance-level editing. The framework uses a Transformer-based Instance Query and a refinement module guided by uncertainty to produce accurate alpha mats and colors, trained end-to-end with a detection+matting loss. Empirically, MP-Mat achieves state-of-the-art results on HIM-100K and SMPMat for instance matting and demonstrates strong zero-shot performance on instance editing tasks, highlighting its practical value for image editing and complex scene understanding.

Abstract

Human instance matting aims to estimate an alpha matte for each human instance in an image, which is challenging as it easily fails in complex cases requiring disentangling mingled pixels belonging to multiple instances along hairy and thin boundary structures. In this work, we address this by introducing MP-Mat, a novel 3D-and-instance-aware matting framework with multiplane representation, where the multiplane concept is designed from two different perspectives: scene geometry level and instance level. Specifically, we first build feature-level multiplane representations to split the scene into multiple planes based on depth differences. This approach makes the scene representation 3D-aware, and can serve as an effective clue for splitting instances in different 3D positions, thereby improving interpretability and boundary handling ability especially in occlusion areas. Then, we introduce another multiplane representation that splits the scene in an instance-level perspective, and represents each instance with both matte and color. We also treat background as a special instance, which is often overlooked by existing methods. Such an instance-level representation facilitates both foreground and background content awareness, and is useful for other down-stream tasks like image editing. Once built, the representation can be reused to realize controllable instance-level image editing with high efficiency. Extensive experiments validate the clear advantage of MP-Mat in matting task. We also demonstrate its superiority in image editing tasks, an area under-explored by existing matting-focused methods, where our approach under zero-shot inference even outperforms trained specialized image editing techniques by large margins. Code is open-sourced at https://github.com/JiaoSiyi/MPMat.git}.

Paper Structure

This paper contains 41 sections, 17 equations, 8 figures, 16 tables, 1 algorithm.

Figures (8)

  • Figure 1: The proposed MP-Mat can perform well in both instance matting and editing tasks, outperforming existing state-of-the-art specialist models designed for individual tasks. Distinguished areas are highlighted with bounding boxes, where MP-Mat preserves finer details and better retains regions that should remain semantically unchanged.
  • Figure 2: The overall framework of the proposed MP-Mat. $\oplus$ indicates concatenation operation.
  • Figure 3: Qualitative comparisons. Distinguished areas are highlighted with bounding boxes.
  • Figure 4: Qualitative comparisons for editing tasks. Yellow boxes highlight the distinguished areas.
  • Figure 5: Examples of the sample pairs within the proposed ORHuman dataset, where we show the pairs of images before and after occlusion reordering (indicated by yellow arrows).
  • ...and 3 more figures