Table of Contents
Fetching ...

Boosting Salient Object Detection with Knowledge Distillated from Large Foundation Models

Miaoyang He, Shuyong Gao, Tsui Qin Mok, Weifeng Ge, Wengqiang Zhang

TL;DR

This work tackles the high cost of pixel-level annotation in salient object detection by proposing a foundation-model guided, weakly supervised pipeline that distills knowledge from large multimodal models to generate high-quality pseudo-labels. It introduces a four-stage pseudo-mask generation workflow (manual text annotation, BLIP fine-tuning, GroundingDINO localization, and SAM segmentation) and presents BDS-TR, a large-scale, diverse dataset with ~260k images spanning ~960 categories. An edge-preserving dynamic decoder (DEDecoder) then leverages precise pseudo-label edges to recover fine details during decoding, guided by a composite loss combining BCE, partial BCE, and IoU terms. Evaluated on five benchmarks, the approach achieves state-of-the-art results among weakly supervised SOD methods and rivals several fully supervised models, demonstrating strong generalization and practical impact; code and results are to be released.

Abstract

Salient Object Detection (SOD) aims to identify and segment prominent regions within a scene. Traditional models rely on manually annotated pseudo labels with precise pixel-level accuracy, which is time-consuming. We developed a low-cost, high-precision annotation method by leveraging large foundation models to address the challenges. Specifically, we use a weakly supervised approach to guide large models in generating pseudo-labels through textual prompts. Since large models do not effectively focus on the salient regions of images, we manually annotate a subset of text to fine-tune the model. Based on this approach, which enables precise and rapid generation of pseudo-labels, we introduce a new dataset, BDS-TR. Compared to the previous DUTS-TR dataset, BDS-TR is more prominent in scale and encompasses a wider variety of categories and scenes. This expansion will enhance our model's applicability across a broader range of scenarios and provide a more comprehensive foundational dataset for future SOD research. Additionally, we present an edge decoder based on dynamic upsampling, which focuses on object edges while gradually recovering image feature resolution. Comprehensive experiments on five benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches and also surpasses several existing fully-supervised SOD methods. The code and results will be made available.

Boosting Salient Object Detection with Knowledge Distillated from Large Foundation Models

TL;DR

This work tackles the high cost of pixel-level annotation in salient object detection by proposing a foundation-model guided, weakly supervised pipeline that distills knowledge from large multimodal models to generate high-quality pseudo-labels. It introduces a four-stage pseudo-mask generation workflow (manual text annotation, BLIP fine-tuning, GroundingDINO localization, and SAM segmentation) and presents BDS-TR, a large-scale, diverse dataset with ~260k images spanning ~960 categories. An edge-preserving dynamic decoder (DEDecoder) then leverages precise pseudo-label edges to recover fine details during decoding, guided by a composite loss combining BCE, partial BCE, and IoU terms. Evaluated on five benchmarks, the approach achieves state-of-the-art results among weakly supervised SOD methods and rivals several fully supervised models, demonstrating strong generalization and practical impact; code and results are to be released.

Abstract

Salient Object Detection (SOD) aims to identify and segment prominent regions within a scene. Traditional models rely on manually annotated pseudo labels with precise pixel-level accuracy, which is time-consuming. We developed a low-cost, high-precision annotation method by leveraging large foundation models to address the challenges. Specifically, we use a weakly supervised approach to guide large models in generating pseudo-labels through textual prompts. Since large models do not effectively focus on the salient regions of images, we manually annotate a subset of text to fine-tune the model. Based on this approach, which enables precise and rapid generation of pseudo-labels, we introduce a new dataset, BDS-TR. Compared to the previous DUTS-TR dataset, BDS-TR is more prominent in scale and encompasses a wider variety of categories and scenes. This expansion will enhance our model's applicability across a broader range of scenarios and provide a more comprehensive foundational dataset for future SOD research. Additionally, we present an edge decoder based on dynamic upsampling, which focuses on object edges while gradually recovering image feature resolution. Comprehensive experiments on five benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches and also surpasses several existing fully-supervised SOD methods. The code and results will be made available.
Paper Structure (22 sections, 4 equations, 9 figures, 4 tables)

This paper contains 22 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Visual comparison of pseudo-label generation between our method and other approaches. Each row, from top to bottom, shows Image, GT, scribble labels zhang2020weakly, point labels gao2022weakly, CAMs zhou2016learning, and ours. Compared to other weakly supervised methods, our approach generates high-precision pseudo-labels.
  • Figure 2: Categories distribution in BDS-TR and comparison with DUTS-TR. In sub-figure (a), each point represents a category within the dataset. BDS-TR significantly surpasses DUTS-TR in both category diversity and quantity. In sub-figure (b), DUTS-TR exhibits a highly uneven distribution within the subcategory of Transportation.
  • Figure 3: Annotation Pipeline, including four steps. Step 1: Manually pre-annotating a small portion of images. Step 2: Fine-tuning BLIP to generate textual descriptions for all images. Step 3: Using GroundingDINO to produce object detection boxes. Step 4: Finally employing SAM to segment the masks. The red background indicates GroundingDINO, while the blue background denotes SAM.
  • Figure 4: Impact of Adjectives in Pseudo Masks: (a) Pseudo Labels Generated Without Adjectives (b) Pseudo Labels Generated With Adjectives.
  • Figure 5: Histogram of Some Common Parent and Subcategories, with Objects Sorted by Frequency. The entire dataset consists of various objects typically found in everyday scenarios.
  • ...and 4 more figures