Table of Contents
Fetching ...

DFIMat: Decoupled Flexible Interactive Matting in Multi-Person Scenarios

Siyi Jiao, Wenzheng Zeng, Changxin Gao, Nong Sang

TL;DR

DFIMat tackles practical interactive portrait matting in multi-person scenes by decoupling localization from refinement. It introduces ISCN for flexible multi-type inputs and instance localization, MRN for boundary-focused refinement, and a contrastive reasoning module to leverage cross-round feedback. A diffusion-based synthetic data pipeline yields SMPMat, a large-scale 40k-image dataset that enhances training realism. Empirical results on SMPMat and HIM2K show DFIMat outperforms state-of-the-art methods, with a lightweight variant achieving strong accuracy, and the work provides actionable guidance on input type usage and data synthesis for scalable matting research.

Abstract

Interactive portrait matting refers to extracting the soft portrait from a given image that best meets the user's intent through their inputs. Existing methods often underperform in complex scenarios, mainly due to three factors. (1) Most works apply a tightly coupled network that directly predicts matting results, lacking interpretability and resulting in inadequate modeling. (2) Existing works are limited to a single type of user input, which is ineffective for intention understanding and also inefficient for user operation. (3) The multi-round characteristics have been under-explored, which is crucial for user interaction. To alleviate these limitations, we propose DFIMat, a decoupled framework that enables flexible interactive matting. Specifically, we first decouple the task into 2 sub-ones: localizing target instances by understanding scene semantics and the flexible user inputs, and conducting refinement for instance-level matting. We observe a clear performance gain from decoupling, as it makes sub-tasks easier to learn, and the flexible multi-type input further enhances both effectiveness and efficiency. DFIMat also considers the multi-round interaction property, where a contrastive reasoning module is designed to enhance cross-round refinement. Another limitation for multi-person matting task is the lack of training data. We address this by introducing a new synthetic data generation pipeline that can generate much more realistic samples than previous arts. A new large-scale dataset SMPMat is subsequently established. Experiments verify the significant superiority of DFIMat. With it, we also investigate the roles of different input types, providing valuable principles for users. Our code and dataset can be found at https://github.com/JiaoSiyi/DFIMat.

DFIMat: Decoupled Flexible Interactive Matting in Multi-Person Scenarios

TL;DR

DFIMat tackles practical interactive portrait matting in multi-person scenes by decoupling localization from refinement. It introduces ISCN for flexible multi-type inputs and instance localization, MRN for boundary-focused refinement, and a contrastive reasoning module to leverage cross-round feedback. A diffusion-based synthetic data pipeline yields SMPMat, a large-scale 40k-image dataset that enhances training realism. Empirical results on SMPMat and HIM2K show DFIMat outperforms state-of-the-art methods, with a lightweight variant achieving strong accuracy, and the work provides actionable guidance on input type usage and data synthesis for scalable matting research.

Abstract

Interactive portrait matting refers to extracting the soft portrait from a given image that best meets the user's intent through their inputs. Existing methods often underperform in complex scenarios, mainly due to three factors. (1) Most works apply a tightly coupled network that directly predicts matting results, lacking interpretability and resulting in inadequate modeling. (2) Existing works are limited to a single type of user input, which is ineffective for intention understanding and also inefficient for user operation. (3) The multi-round characteristics have been under-explored, which is crucial for user interaction. To alleviate these limitations, we propose DFIMat, a decoupled framework that enables flexible interactive matting. Specifically, we first decouple the task into 2 sub-ones: localizing target instances by understanding scene semantics and the flexible user inputs, and conducting refinement for instance-level matting. We observe a clear performance gain from decoupling, as it makes sub-tasks easier to learn, and the flexible multi-type input further enhances both effectiveness and efficiency. DFIMat also considers the multi-round interaction property, where a contrastive reasoning module is designed to enhance cross-round refinement. Another limitation for multi-person matting task is the lack of training data. We address this by introducing a new synthetic data generation pipeline that can generate much more realistic samples than previous arts. A new large-scale dataset SMPMat is subsequently established. Experiments verify the significant superiority of DFIMat. With it, we also investigate the roles of different input types, providing valuable principles for users. Our code and dataset can be found at https://github.com/JiaoSiyi/DFIMat.

Paper Structure

This paper contains 16 sections, 8 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison of different methods. DFIMat supports (1) multi-type user inputs, (2) any combination of different types of input at each time, and (3) multi-round iteration.
  • Figure 2: The overall framework of DFIMat, which consists of two components: (a) Interactive semantic capture network (ISCN), and (b) Matting refinement network (MRN).
  • Figure 3: Visual comparison of synthetic datasets.
  • Figure 4: The synthetic data generation pipeline.
  • Figure 5: Performance comparison under different rounds of interaction on SMPMat.
  • ...and 2 more figures