Table of Contents
Fetching ...

Tuning-Free Amodal Segmentation via the Occlusion-Free Bias of Inpainting Models

Jae Joong Lee, Bedrich Benes, Raymond A. Yeh

TL;DR

This work tackles amodal segmentation by proposing a tuning-free, zero-shot approach that repurposes pretrained diffusion-based inpainting models. By exploiting the occlusion-free bias of inpainting, it fills occluded regions and applies segmentation without additional training, using a carefully designed conditioning pipeline: a context-aware condition image, a soft inpainting area, and leakage conditioning to preserve scene context. Quantitative results across five diverse datasets show consistent improvements over the prior SOTA, with SDXL-based implementations delivering notable gains and efficiency benefits. The approach enables generalizable amodal predictions across unseen categories and occlusion scenarios, highlighting the practical potential of diffusion-based priors for segmentation tasks without amodal data.

Abstract

Amodal segmentation aims to predict segmentation masks for both the visible and occluded regions of an object. Most existing works formulate this as a supervised learning problem, requiring manually annotated amodal masks or synthetic training data. Consequently, their performance depends on the quality of the datasets, which often lack diversity and scale. This work introduces a tuning-free approach that repurposes pretrained diffusion-based inpainting models for amodal segmentation. Our approach is motivated by the "occlusion-free bias" of inpainting models, i.e., the inpainted objects tend to be complete objects without occlusions. Specifically, we reconstruct the occluded regions of an object via inpainting and then apply segmentation, all without additional training or fine-tuning. Experiments on five datasets demonstrate the generalizability and robustness of our approach. On average, our approach achieves 5.3% more accurate masks over the state-of-the-art.

Tuning-Free Amodal Segmentation via the Occlusion-Free Bias of Inpainting Models

TL;DR

This work tackles amodal segmentation by proposing a tuning-free, zero-shot approach that repurposes pretrained diffusion-based inpainting models. By exploiting the occlusion-free bias of inpainting, it fills occluded regions and applies segmentation without additional training, using a carefully designed conditioning pipeline: a context-aware condition image, a soft inpainting area, and leakage conditioning to preserve scene context. Quantitative results across five diverse datasets show consistent improvements over the prior SOTA, with SDXL-based implementations delivering notable gains and efficiency benefits. The approach enables generalizable amodal predictions across unseen categories and occlusion scenarios, highlighting the practical potential of diffusion-based priors for segmentation tasks without amodal data.

Abstract

Amodal segmentation aims to predict segmentation masks for both the visible and occluded regions of an object. Most existing works formulate this as a supervised learning problem, requiring manually annotated amodal masks or synthetic training data. Consequently, their performance depends on the quality of the datasets, which often lack diversity and scale. This work introduces a tuning-free approach that repurposes pretrained diffusion-based inpainting models for amodal segmentation. Our approach is motivated by the "occlusion-free bias" of inpainting models, i.e., the inpainted objects tend to be complete objects without occlusions. Specifically, we reconstruct the occluded regions of an object via inpainting and then apply segmentation, all without additional training or fine-tuning. Experiments on five datasets demonstrate the generalizability and robustness of our approach. On average, our approach achieves 5.3% more accurate masks over the state-of-the-art.

Paper Structure

This paper contains 16 sections, 10 equations, 12 figures, 16 tables.

Figures (12)

  • Figure 1: Occlusion-free bias for a diffusion inpainting model. We observe that an inpainted object is always placed without occlusions inside the inpainting area (blue box), e.g., a tree could have been inpainted behind the fence.
  • Figure 2: Our approach takes two inputs an RGB image ${\bm{I}}$ and a visible mask ${\bm{V}}$. From ${\bm{I}}$, we generate a conditioned RGB image with a color distribution-aware background ${\bm{x}}_{\tt bck}$ and a partial Gaussian noise-added object ${\bm{x}}_{\tt obj}$. From ${\bm{V}}$, we create a customized inpainting area ${\bm{M}}$ so that we utilize any diffusion-based inpainting models to create an inpainted image $\hat{{\bm{x}}}_0$ to extract amodal mask $\hat{{\bm{A}}}$.
  • Figure 3: We show a visual process of the diffusion model. As we are doing soft-inpainting, observe that our approach can predict an amodal mask much larger than the visible mask, i.e., extrapolate.
  • Figure 4: We compare the accuracy of amodal mask in COCO-A cocoa and BSDS-A bsdsa using various diffusion-based inpainting models. Keep in mind that we focus on generating accurate amodal masks rather than on synthesizing an accurate and high-quality image. We highlight incomplete and out-of-shape areas using a red box.
  • Figure 5: Qualitative comparison of amodal mask on KINS kins, FishBowl fishbowl, and SAILVOS SAILVOS. We observe that for novel categories/ domain pix2gestalt may hallucinate inaccurate amodal masks.
  • ...and 7 more figures