HASSOD: Hierarchical Adaptive Self-Supervised Object Detection

Shengcao Cao; Dhiraj Joshi; Liang-Yan Gui; Yu-Xiong Wang

HASSOD: Hierarchical Adaptive Self-Supervised Object Detection

Shengcao Cao, Dhiraj Joshi, Liang-Yan Gui, Yu-Xiong Wang

TL;DR

HASSOD addresses the challenge of learning object detection and composition without supervision by introducing a hierarchical adaptive clustering mechanism that adaptively determines the number of objects per image, paired with a hierarchy-based understanding of whole/part/subpart object composition. It further refines detection through Mean Teacher self-training with adaptive targets, replacing multi-round self-training for smoother, more efficient learning. The approach yields state-of-the-art self-supervised results on zero-shot benchmarks (e.g., LVIS and SA-1B), achieving substantial AR gains while using only a fraction of data and iterations. This combination improves both detection performance and interpretability, enabling finer-grained control over segmentation granularity and object composition in a fully unsupervised regime.

Abstract

The human visual perception system demonstrates exceptional capabilities in learning without explicit supervision and understanding the part-to-whole composition of objects. Drawing inspiration from these two abilities, we propose Hierarchical Adaptive Self-Supervised Object Detection (HASSOD), a novel approach that learns to detect objects and understand their compositions without human supervision. HASSOD employs a hierarchical adaptive clustering strategy to group regions into object masks based on self-supervised visual representations, adaptively determining the number of objects per image. Furthermore, HASSOD identifies the hierarchical levels of objects in terms of composition, by analyzing coverage relations between masks and constructing tree structures. This additional self-supervised learning task leads to improved detection performance and enhanced interpretability. Lastly, we abandon the inefficient multi-round self-training process utilized in prior methods and instead adapt the Mean Teacher framework from semi-supervised learning, which leads to a smoother and more efficient training process. Through extensive experiments on prevalent image datasets, we demonstrate the superiority of HASSOD over existing methods, thereby advancing the state of the art in self-supervised object detection. Notably, we improve Mask AR from 20.2 to 22.5 on LVIS, and from 17.0 to 26.0 on SA-1B. Project page: https://HASSOD-NeurIPS23.github.io.

HASSOD: Hierarchical Adaptive Self-Supervised Object Detection

TL;DR

Abstract

Paper Structure (24 sections, 11 figures, 9 tables)

This paper contains 24 sections, 11 figures, 9 tables.

Introduction
Related Work
Approach
Hierarchical Adaptive Clustering
Hierarchical Level Prediction
Mean Teacher Training with Adaptive Targets
Experiments
Data-Efficient and Computation-Efficient Training
Evaluation Datasets and Metrics
Self-Supervised Detection and Segmentation
Qualitative Results
Ablation Study
Conclusion
Deficiency of MS-COCO AP Evaluation in Self-Supervised Object Detection
Comparison of CutLER and HASSOD with Equal Training Data
...and 9 more sections

Figures (11)

Figure 1: Fully self-supervised object detection and instance segmentation on prevalent image datasets. Our approach, HASSOD, demonstrates a significant improvement over the previous state-of-the-art method, CutLER wang2023cut, by discovering a more comprehensive range of objects. Moreover, HASSOD understands the part-to-whole object composition like humans do, while previous methods cannot.
Figure 2: Two-stage discover-and-learn process in HASSOD. Stage 1 uses a frozen, self-supervised DINO caron2021emerging ViT backbone to discover initial pseudo-labels from unlabeled images. Stage 2 learns an object detector to improve over the pre-trained features and initial pseudo-labels.
Figure 3: Hierarchical adaptive clustering and hierarchical levels of objects. The procedure of creating initial pseudo-labels for training the object detector without any human annotations includes the following steps: (Initialize) Visual features are extracted from the given image by a ViT pre-trained with DINO caron2021emerging, and each $8\times 8$ patch is initialized as one individual region. (Merge) Adjacent regions with the highest feature similarities are progressively merged into object masks, until the pre-set thresholds $\theta_i^\text{merge}$ are reached. (Post-Process) Object masks are selected and refined using simple post-processing techniques. (Ensemble) Results from multiple thresholds $\{\theta_i^\text{merge}\}_{i=1}^3$ are combined to ensure better coverage of potential objects. (Split) Analysis of coverage relations divides objects into three hierarchical levels: whole, part, and subpart. The example on the right illustrates the tree structure of object composition: The whole aircraft is composed of an upper and a lower part. The upper part further consists of a left wing, a right wing, and a person standing on it.
Figure 4: Mean Teacher self-training with adaptive targets in HASSOD. Two detectors of the same architecture, the teacher and the student, learn from each other to improve over the initial pseudo-labels. The teacher is updated as the exponential moving average (EMA) of the student. The student receives supervision from two branches: The teacher-to-student branch (top) encourages the student to mimic the teacher's predictions; the label-to-student branch (bottom) minimizes the discrepancy between the student's predictions and the initial pseudo-labels. During training, our proposed adaptive target strategy increases the weight for the teacher-to-student branch, $\alpha_\text{teacher}$, and decreases the weight for the label-to-student branch, $\alpha_\text{label}$, since the teacher becomes a more and more reliable self-supervision source compared with the initial pseudo-labels.
Figure 5: Qualitative results on LVIS images. Overall, our HASSOD successfully detects more objects compared with CutLER wang2023cut. CutLER tends to detect only one or few prominent objects in the image, while HASSOD captures other objects as well (e.g., bread in row 1, and traffic sign in row 3). Moreover, HASSOD learns the composition of objects (e.g., cat-face-eye in row 2, and vehicle-wheel-tire in row 4), which is similar to human perception.
...and 6 more figures

HASSOD: Hierarchical Adaptive Self-Supervised Object Detection

TL;DR

Abstract

HASSOD: Hierarchical Adaptive Self-Supervised Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (11)