Table of Contents
Fetching ...

WeakSAM: Segment Anything Meets Weakly-supervised Instance-level Recognition

Lianghui Zhu, Junwei Zhou, Yan Liu, Xin Hao, Wenyu Liu, Xinggang Wang

TL;DR

WeakSAM integrates the Segment Anything Model with weakly supervised learning by automatically prompting SAM with classification clues, enabling class-aware proposals for WSOD and WSIS. It tackles pseudo ground-truth incompleteness and noise via adaptive PGT generation and RoI drop regularization, while extending SAM to automatic, label-aware segmentation. The approach demonstrates state-of-the-art results on WSOD and WSIS benchmarks with significant efficiency gains, and its PGT-refinement strategy provides a practical path toward scalable weakly supervised recognition. This framework offers a unified, SAM-powered solution that reduces labeling costs and improves instance-level perception in vision tasks.

Abstract

Weakly supervised visual recognition using inexact supervision is a critical yet challenging learning problem. It significantly reduces human labeling costs and traditionally relies on multi-instance learning and pseudo-labeling. This paper introduces WeakSAM and solves the weakly-supervised object detection (WSOD) and segmentation by utilizing the pre-learned world knowledge contained in a vision foundation model, i.e., the Segment Anything Model (SAM). WeakSAM addresses two critical limitations in traditional WSOD retraining, i.e., pseudo ground truth (PGT) incompleteness and noisy PGT instances, through adaptive PGT generation and Region of Interest (RoI) drop regularization. It also addresses the SAM's problems of requiring prompts and category unawareness for automatic object detection and segmentation. Our results indicate that WeakSAM significantly surpasses previous state-of-the-art methods in WSOD and WSIS benchmarks with large margins, i.e. average improvements of 7.4% and 8.5%, respectively. The code is available at \url{https://github.com/hustvl/WeakSAM}.

WeakSAM: Segment Anything Meets Weakly-supervised Instance-level Recognition

TL;DR

WeakSAM integrates the Segment Anything Model with weakly supervised learning by automatically prompting SAM with classification clues, enabling class-aware proposals for WSOD and WSIS. It tackles pseudo ground-truth incompleteness and noise via adaptive PGT generation and RoI drop regularization, while extending SAM to automatic, label-aware segmentation. The approach demonstrates state-of-the-art results on WSOD and WSIS benchmarks with significant efficiency gains, and its PGT-refinement strategy provides a practical path toward scalable weakly supervised recognition. This framework offers a unified, SAM-powered solution that reduces labeling costs and improves instance-level perception in vision tasks.

Abstract

Weakly supervised visual recognition using inexact supervision is a critical yet challenging learning problem. It significantly reduces human labeling costs and traditionally relies on multi-instance learning and pseudo-labeling. This paper introduces WeakSAM and solves the weakly-supervised object detection (WSOD) and segmentation by utilizing the pre-learned world knowledge contained in a vision foundation model, i.e., the Segment Anything Model (SAM). WeakSAM addresses two critical limitations in traditional WSOD retraining, i.e., pseudo ground truth (PGT) incompleteness and noisy PGT instances, through adaptive PGT generation and Region of Interest (RoI) drop regularization. It also addresses the SAM's problems of requiring prompts and category unawareness for automatic object detection and segmentation. Our results indicate that WeakSAM significantly surpasses previous state-of-the-art methods in WSOD and WSIS benchmarks with large margins, i.e. average improvements of 7.4% and 8.5%, respectively. The code is available at \url{https://github.com/hustvl/WeakSAM}.
Paper Structure (38 sections, 4 equations, 9 figures, 11 tables, 2 algorithms)

This paper contains 38 sections, 4 equations, 9 figures, 11 tables, 2 algorithms.

Figures (9)

  • Figure 1: Quantitative comparisons between WeakSAM and previous SOTA methods under different tasks and benchmarks. The scale of each axis in the radar chart is normalized by the performance of the previous SOTA methods (marked in parentheses), and the stride of each axis is the same.
  • Figure 2: An overview of the proposed WeakSAM framework. We first generate activation maps from a classification ViT zhu2023weaktr. Subsequently, we introduce classification clues and spatial points as automatic WeakSAM prompts, which address the problem of SAM requiring interactive prompts. Next, we use the WeakSAM proposals in the WSOD pipeline, in which the weakly-supervised detector performs class-aware perception to annotate pseudo ground truth (PGT). Then, we analyze the incompleteness and the noise problem existing in PGT and propose adaptive PGT generation, RoI drop regularization to address them, respectively. Finally, we launch WSIS training supervised by pseudo instance labels, which requires adaptive PGT as SAM prompts. The snowflake mark means the model is frozen.
  • Figure 3: The relationship between the normalized classification loss, corresponding number of RoIs and error rate. The results are obtained from training the Faster-RCNN using PGT in the preliminary training stage.
  • Figure 4: Visualization of the weakly-supervised object detection on the PASCAL VOC 2007 $test$ set.
  • Figure 5: Visualization of the weakly-supervised instance segmentation on the PASCAL VOC 2012 $val$ set.
  • ...and 4 more figures