Table of Contents
Fetching ...

Amodal Ground Truth and Completion in the Wild

Guanqi Zhan, Chuanxia Zheng, Weidi Xie, Andrew Zisserman

TL;DR

The paper tackles amodal segmentation in real-world images by generating authentic ground-truth amodal masks from 3D scene data, resulting in the MP3D-Amodal benchmark. It introduces two architectures that do not require occluder masks at inference: OccAmodal, a two-stage approach that first predicts the occluder and then the amodal mask, and SDAmodal, a one-stage method that leverages pre-trained Stable Diffusion features for amodal completion. Both approaches achieve state-of-the-art performance on COCOA and MP3D-Amodal, with SDAmodal demonstrating strong cross-domain generalization to unseen object categories. The work demonstrates that automatic 3D-grounded ground truth enables robust, model-agnostic amodal completion in the wild, with practical implications for downstream tasks like 3D reconstruction and manipulation planning.

Abstract

This paper studies amodal image segmentation: predicting entire object segmentation masks including both visible and invisible (occluded) parts. In previous work, the amodal segmentation ground truth on real images is usually predicted by manual annotaton and thus is subjective. In contrast, we use 3D data to establish an automatic pipeline to determine authentic ground truth amodal masks for partially occluded objects in real images. This pipeline is used to construct an amodal completion evaluation benchmark, MP3D-Amodal, consisting of a variety of object categories and labels. To better handle the amodal completion task in the wild, we explore two architecture variants: a two-stage model that first infers the occluder, followed by amodal mask completion; and a one-stage model that exploits the representation power of Stable Diffusion for amodal segmentation across many categories. Without bells and whistles, our method achieves a new state-of-the-art performance on Amodal segmentation datasets that cover a large variety of objects, including COCOA and our new MP3D-Amodal dataset. The dataset, model, and code are available at https://www.robots.ox.ac.uk/~vgg/research/amodal/.

Amodal Ground Truth and Completion in the Wild

TL;DR

The paper tackles amodal segmentation in real-world images by generating authentic ground-truth amodal masks from 3D scene data, resulting in the MP3D-Amodal benchmark. It introduces two architectures that do not require occluder masks at inference: OccAmodal, a two-stage approach that first predicts the occluder and then the amodal mask, and SDAmodal, a one-stage method that leverages pre-trained Stable Diffusion features for amodal completion. Both approaches achieve state-of-the-art performance on COCOA and MP3D-Amodal, with SDAmodal demonstrating strong cross-domain generalization to unseen object categories. The work demonstrates that automatic 3D-grounded ground truth enables robust, model-agnostic amodal completion in the wild, with practical implications for downstream tasks like 3D reconstruction and manipulation planning.

Abstract

This paper studies amodal image segmentation: predicting entire object segmentation masks including both visible and invisible (occluded) parts. In previous work, the amodal segmentation ground truth on real images is usually predicted by manual annotaton and thus is subjective. In contrast, we use 3D data to establish an automatic pipeline to determine authentic ground truth amodal masks for partially occluded objects in real images. This pipeline is used to construct an amodal completion evaluation benchmark, MP3D-Amodal, consisting of a variety of object categories and labels. To better handle the amodal completion task in the wild, we explore two architecture variants: a two-stage model that first infers the occluder, followed by amodal mask completion; and a one-stage model that exploits the representation power of Stable Diffusion for amodal segmentation across many categories. Without bells and whistles, our method achieves a new state-of-the-art performance on Amodal segmentation datasets that cover a large variety of objects, including COCOA and our new MP3D-Amodal dataset. The dataset, model, and code are available at https://www.robots.ox.ac.uk/~vgg/research/amodal/.
Paper Structure (29 sections, 2 equations, 15 figures, 8 tables)

This paper contains 29 sections, 2 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Amodal Ground Truth and Completion in the Wild. Top: The task of amodal completion is to predict the amodal mask $A_i$ for an object '$i$' in the image specified by the modal mask $M_i$ (here the object is the rear motorbike). Previous methods zhan2020selfnguyen2021weakly require the mask of the occluder $F_i$ to be also provided to do the task; but our goal is to predict the amodal mask when the occluder mask is not provided and the occluded object is of any category. Bottom: We propose a novel method for generating amodal masks for real images: using 3D structure to produce ground truth modal and amodal masks for object instances. The method is used to generate a ground truth evaluation dataset for real images.
  • Figure 2: Samples from the MP3D-Amodal Dataset. For each sample, the original image is shown together with the generated modal and amodal masks.
  • Figure 3: Automated Generation of the MP3D-Amodal Ground Truth Dataset. The dataset is automatically generated from the MatterPort3D Matterport3D dataset, and provides ground truth modal and amodal masks for objects in real images. The generation process is illustrated here for the chair and proceeds in two steps: first, modal and amodal masks in a particular image are obtained for each object by projecting the object's 3D mesh individually (for the amodal mask), and also by projecting the 3D mesh of all objects (for the modal mask). In this example, the 3D mesh of the bed occludes the chair when projected into the image. In the second step, an object is selected for the dataset if its amodal mask is larger than its modal mask by a threshold. In this case the chair is selected, but other objects such as the stool would not be selected since it is not occluded by other objects in this viewpoint, and so their modal and amodal masks would be the same.
  • Figure 4: Distributions of the MP3D-Amodal Dataset in terms of the number of instances for each MatterPort category, and the number of instances for different occlusion ratios.
  • Figure 5: Two-Stage Architecture (OccAmodal) for Amodal Prediction.Left: A lightweight U-Net based architecture is used to predict the occluder mask for each object. Right: The amodal predictor takes the predicted occluder mask, together with the modal mask and image as input to predict the amodal segmentation mask.
  • ...and 10 more figures