Table of Contents
Fetching ...

MaskDiff: Modeling Mask Distribution with Diffusion Probabilistic Model for Few-Shot Instance Segmentation

Minh-Quan Le, Tam V. Nguyen, Trung-Nghia Le, Thanh-Toan Do, Minh N. Do, Minh-Triet Tran

TL;DR

MaskDiff tackles few-shot instance segmentation by modeling the conditional distribution of binary masks with a diffusion probabilistic approach. It uses a UNet-based denoiser with conditioning on image regions and $K$-shot information, augmented by classifier-free guided sampling to inject category signals. The paper provides full forward/reverse diffusion derivations, a variational upper-bound loss with a simple training objective, and thorough ablations plus strong COCO-based results showing improved accuracy and stability over prior FSOD/FSIS methods. The approach preserves spatial details by using object-region conditioning rather than pooling, offering competitive performance across base and novel classes. Overall, MaskDiff demonstrates that diffusion-based conditional mask modeling yields robust, high-precision segmentation in data-scarce regimes with practical gains for FSIS tasks.

Abstract

Few-shot instance segmentation extends the few-shot learning paradigm to the instance segmentation task, which tries to segment instance objects from a query image with a few annotated examples of novel categories. Conventional approaches have attempted to address the task via prototype learning, known as point estimation. However, this mechanism depends on prototypes (\eg mean of $K-$shot) for prediction, leading to performance instability. To overcome the disadvantage of the point estimation mechanism, we propose a novel approach, dubbed MaskDiff, which models the underlying conditional distribution of a binary mask, which is conditioned on an object region and $K-$shot information. Inspired by augmentation approaches that perturb data with Gaussian noise for populating low data density regions, we model the mask distribution with a diffusion probabilistic model. We also propose to utilize classifier-free guided mask sampling to integrate category information into the binary mask generation process. Without bells and whistles, our proposed method consistently outperforms state-of-the-art methods on both base and novel classes of the COCO dataset while simultaneously being more stable than existing methods. The source code is available at: https://github.com/minhquanlecs/MaskDiff.

MaskDiff: Modeling Mask Distribution with Diffusion Probabilistic Model for Few-Shot Instance Segmentation

TL;DR

MaskDiff tackles few-shot instance segmentation by modeling the conditional distribution of binary masks with a diffusion probabilistic approach. It uses a UNet-based denoiser with conditioning on image regions and -shot information, augmented by classifier-free guided sampling to inject category signals. The paper provides full forward/reverse diffusion derivations, a variational upper-bound loss with a simple training objective, and thorough ablations plus strong COCO-based results showing improved accuracy and stability over prior FSOD/FSIS methods. The approach preserves spatial details by using object-region conditioning rather than pooling, offering competitive performance across base and novel classes. Overall, MaskDiff demonstrates that diffusion-based conditional mask modeling yields robust, high-precision segmentation in data-scarce regimes with practical gains for FSIS tasks.

Abstract

Few-shot instance segmentation extends the few-shot learning paradigm to the instance segmentation task, which tries to segment instance objects from a query image with a few annotated examples of novel categories. Conventional approaches have attempted to address the task via prototype learning, known as point estimation. However, this mechanism depends on prototypes (\eg mean of shot) for prediction, leading to performance instability. To overcome the disadvantage of the point estimation mechanism, we propose a novel approach, dubbed MaskDiff, which models the underlying conditional distribution of a binary mask, which is conditioned on an object region and shot information. Inspired by augmentation approaches that perturb data with Gaussian noise for populating low data density regions, we model the mask distribution with a diffusion probabilistic model. We also propose to utilize classifier-free guided mask sampling to integrate category information into the binary mask generation process. Without bells and whistles, our proposed method consistently outperforms state-of-the-art methods on both base and novel classes of the COCO dataset while simultaneously being more stable than existing methods. The source code is available at: https://github.com/minhquanlecs/MaskDiff.
Paper Structure (9 sections, 25 equations, 4 figures, 8 tables)

This paper contains 9 sections, 25 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Denoising architecture of diffusion probabilistic model for mask distribution modeling. This architecture is based on the UNet architecture with additional modifications including addding Residual Block (ResBlock) and Attention Block (Att). With respect to the conditioning module, we simply concatenate three components namely $\mathbf{y}_t,\mathbf{x}, \mathbf{k}$ and feed it into the network for generating the less noisy version $\mathbf{y}_{t-1}$.
  • Figure 2: Inference examples. Successful (top four rows) and failure cases (bottom two rows) when training and inference on the one-shot setting for the COCO novel classes. Failures include wrong classification ($5$th row), miss detection, and imprecise instance segmentation (the bottom row).
  • Figure 3: Qualitative results of inference procedure of diffusion model with and without guided sampling. It is apparent that guided sampling can generate more organized and semantic content.
  • Figure 4: MaskDiff preserves spatial information, especially at detailed levels. We compare the performance in terms of AP of MaskDiff with state-of-the-art methods in $K=1,5,10$ shots instance segmentation at different IoU thresholds. MaskDiff outperforms other methods with large margins, especially at high IoU thresholds, which indicates its ability to segment objects more precisely.