Table of Contents
Fetching ...

PLUG: Revisiting Amodal Segmentation with Foundation Model and Hierarchical Focus

Zhaochen Liu, Limeng Qiao, Xiangxiang Chu, Tingting Jiang

TL;DR

This work tackles amodal segmentation by reframing SAM as a foundation-model-based solution that can predict complete object shapes under occlusion. It introduces PLUG, a hierarchical approach with region-level parallel LoRA adapters and a point-level uncertainty-guided loss, enabling two specialized branches to predict inmodal and amodal regions before a refine module fuses them into a final prediction. The method demonstrates state-of-the-art performance on KINS and COCOA with significantly fewer trainable parameters, thanks to parameter-efficient fine-tuning and the two-branch design. The results highlight the practical impact of leveraging foundation-model priors for data-deficient tasks and point to promising directions for handling ambiguous boundaries and non-rigid objects in amodal segmentation, with a lightweight, scalable framework grounded in SAM.

Abstract

Aiming to predict the complete shapes of partially occluded objects, amodal segmentation is an important step towards visual intelligence. With crucial significance, practical prior knowledge derives from sufficient training, while limited amodal annotations pose challenges to achieve better performance. To tackle this problem, utilizing the mighty priors accumulated in the foundation model, we propose the first SAM-based amodal segmentation approach, PLUG. Methodologically, a novel framework with hierarchical focus is presented to better adapt the task characteristics and unleash the potential capabilities of SAM. In the region level, due to the association and division in visible and occluded areas, inmodal and amodal regions are assigned as the focuses of distinct branches to avoid mutual disturbance. In the point level, we introduce the concept of uncertainty to explicitly assist the model in identifying and focusing on ambiguous points. Guided by the uncertainty map, a computation-economic point loss is applied to improve the accuracy of predicted boundaries. Experiments are conducted on several prominent datasets, and the results show that our proposed method outperforms existing methods with large margins. Even with fewer total parameters, our method still exhibits remarkable advantages.

PLUG: Revisiting Amodal Segmentation with Foundation Model and Hierarchical Focus

TL;DR

This work tackles amodal segmentation by reframing SAM as a foundation-model-based solution that can predict complete object shapes under occlusion. It introduces PLUG, a hierarchical approach with region-level parallel LoRA adapters and a point-level uncertainty-guided loss, enabling two specialized branches to predict inmodal and amodal regions before a refine module fuses them into a final prediction. The method demonstrates state-of-the-art performance on KINS and COCOA with significantly fewer trainable parameters, thanks to parameter-efficient fine-tuning and the two-branch design. The results highlight the practical impact of leveraging foundation-model priors for data-deficient tasks and point to promising directions for handling ambiguous boundaries and non-rigid objects in amodal segmentation, with a lightweight, scalable framework grounded in SAM.

Abstract

Aiming to predict the complete shapes of partially occluded objects, amodal segmentation is an important step towards visual intelligence. With crucial significance, practical prior knowledge derives from sufficient training, while limited amodal annotations pose challenges to achieve better performance. To tackle this problem, utilizing the mighty priors accumulated in the foundation model, we propose the first SAM-based amodal segmentation approach, PLUG. Methodologically, a novel framework with hierarchical focus is presented to better adapt the task characteristics and unleash the potential capabilities of SAM. In the region level, due to the association and division in visible and occluded areas, inmodal and amodal regions are assigned as the focuses of distinct branches to avoid mutual disturbance. In the point level, we introduce the concept of uncertainty to explicitly assist the model in identifying and focusing on ambiguous points. Guided by the uncertainty map, a computation-economic point loss is applied to improve the accuracy of predicted boundaries. Experiments are conducted on several prominent datasets, and the results show that our proposed method outperforms existing methods with large margins. Even with fewer total parameters, our method still exhibits remarkable advantages.
Paper Structure (25 sections, 11 equations, 5 figures, 3 tables)

This paper contains 25 sections, 11 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The visualization of performance comparison.$\text{mIoU}_{full}, \text{mIoU}_{occ}$ represent the mean IoU for the complete mask and the occluded region of each object, respectively.
  • Figure 2: The architecture of PLUG. (a,b) On the basis of SAM, two parallel sets of LoRA adapters (Inmodal LoRA, Amodal LoRA) and corresponding two mask decoders (Inmodal Decoder, Amodal Decoder) are introduced to separately process diverse regions and avoid mutual disturbance. Guided by uncertainty maps (defined in Sec. \ref{['sec:point']}), a simple yet effective refine module is added afterwards to rectify ambiguous points near the boundary. The refine module takes the original image, the coarse predictions and the uncertainty maps as input. (c) In each transformer block of the image encoder, low-rank adaptation matrices are applied to the attention module. The calculation of $Q,V$ passes two parallel side roads focusing on inmodal and amodal regions respectively (refer to Sec. \ref{['sec:parallel']}).
  • Figure 3: An illustration of the uncertainty guidance. (a) The uncertainty of each pixel is defined as the average cross entropy of its neighborhood. (b) We select $cK$ points with high uncertainty (top $cK$) and stochastic $(1-c)K$ points from randomly chosen $nK$ points to apply the point loss.
  • Figure 4: Qualitative results. The qualitative comparison of predicted amodal masks from VRSP, AISFormer, C2F-Seg and our proposed PLUG approach. The first two rows are from the KINS dataset, while the last three rows are from the COCOA dataset. Zoom in for a better view.
  • Figure 5: An example of the limitation. In this image, the man's occluded arm is not well segmented.