CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers

Shahaf Arica; Or Rubin; Sapir Gershov; Shlomi Laufer

CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers

Shahaf Arica, Or Rubin, Sapir Gershov, Shlomi Laufer

TL;DR

Vote Cut is introduced, an innovative method for unsupervised object discovery that leverages feature representations from multiple self-supervised models and CuVLER (Cut-Vote-and-LEaRn), a zero-shot model, trained using pseudo-labels, generated by Vote Cut and a novel soft target loss to refine segmentation accuracy.

Abstract

In this paper, we introduce VoteCut, an innovative method for unsupervised object discovery that leverages feature representations from multiple self-supervised models. VoteCut employs normalized-cut based graph partitioning, clustering and a pixel voting approach. Additionally, We present CuVLER (Cut-Vote-and-LEaRn), a zero-shot model, trained using pseudo-labels, generated by VoteCut, and a novel soft target loss to refine segmentation accuracy. Through rigorous evaluations across multiple datasets and several unsupervised setups, our methods demonstrate significant improvements in comparison to previous state-of-the-art models. Our ablation studies further highlight the contributions of each component, revealing the robustness and efficacy of our approach. Collectively, VoteCut and CuVLER pave the way for future advancements in image segmentation.

CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers

TL;DR

Abstract

Paper Structure (19 sections, 8 equations, 8 figures, 9 tables)

This paper contains 19 sections, 8 equations, 8 figures, 9 tables.

Introduction
Related work
Method
Normalized Cuts
VoteCut for object discoveries
Soft loss function
CuVLER
Implementation details
Experiments
In-domain evaluation
Zero-shot evaluation
Self-training evaluation
Ablations
Conclusion and Limitation
Additional details
...and 4 more sections

Figures (8)

Figure 1: (a) An illustrated overview of the VoteCut workflow. A set of models initially makes inferences on the input image, producing feature representations for individual patches. Subsequently, Normalized Cuts (NCut) are performed following the methodology in wang2022tokencut, yielding the second smallest eigenvectors from each model. Multiple segment proposals are generated by applying 1D K-means clustering to these eigenvectors with varying K values. The final stage of VoteCut involves clustering these proposals and extracting definitive masks from each cluster via voting. Each definitive mask is also associated with a score. (b) The "Clustering & Voting" stage of VoteCut is detailed. First, segments are clustered using an Intersection over Union (IoU) threshold, determining segment membership within clusters. A voting mechanism is employed within each cluster to decide whether each patch should be included in the segment. Lastly, a Conditional Random Field (CRF) krahenbuhl2011efficient is applied to refine the mask at a finer level. The cluster size determines the score assigned to each mask, as elucidated in \ref{['eq:score']}.
Figure 2: Visual illustration of VoteCut performance vs. SOTA NCut based object-discovery methods on the ImageNet validation set. The VoteCut bounding box score is calculated according to \ref{['eq:score']}
Figure 3: In-domain evaluation of the VoteCut method, without CAD training, with varying $\tau^m$ on the ImageNet validation set.
Figure 4: Results of the VoteCut method without CAD training in an in-domain configuration with different $k_{max}$ values on the ImageNet validation set.
Figure 5: Model count ablation test. The results are obtained in an in-domain setup on the ImageNet validation set using the VoteCut method without CAD training.
...and 3 more figures

CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers

TL;DR

Abstract

CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (8)