MaskDiff: Modeling Mask Distribution with Diffusion Probabilistic Model for Few-Shot Instance Segmentation
Minh-Quan Le, Tam V. Nguyen, Trung-Nghia Le, Thanh-Toan Do, Minh N. Do, Minh-Triet Tran
TL;DR
MaskDiff tackles few-shot instance segmentation by modeling the conditional distribution of binary masks with a diffusion probabilistic approach. It uses a UNet-based denoiser with conditioning on image regions and $K$-shot information, augmented by classifier-free guided sampling to inject category signals. The paper provides full forward/reverse diffusion derivations, a variational upper-bound loss with a simple training objective, and thorough ablations plus strong COCO-based results showing improved accuracy and stability over prior FSOD/FSIS methods. The approach preserves spatial details by using object-region conditioning rather than pooling, offering competitive performance across base and novel classes. Overall, MaskDiff demonstrates that diffusion-based conditional mask modeling yields robust, high-precision segmentation in data-scarce regimes with practical gains for FSIS tasks.
Abstract
Few-shot instance segmentation extends the few-shot learning paradigm to the instance segmentation task, which tries to segment instance objects from a query image with a few annotated examples of novel categories. Conventional approaches have attempted to address the task via prototype learning, known as point estimation. However, this mechanism depends on prototypes (\eg mean of $K-$shot) for prediction, leading to performance instability. To overcome the disadvantage of the point estimation mechanism, we propose a novel approach, dubbed MaskDiff, which models the underlying conditional distribution of a binary mask, which is conditioned on an object region and $K-$shot information. Inspired by augmentation approaches that perturb data with Gaussian noise for populating low data density regions, we model the mask distribution with a diffusion probabilistic model. We also propose to utilize classifier-free guided mask sampling to integrate category information into the binary mask generation process. Without bells and whistles, our proposed method consistently outperforms state-of-the-art methods on both base and novel classes of the COCO dataset while simultaneously being more stable than existing methods. The source code is available at: https://github.com/minhquanlecs/MaskDiff.
