Let Go of Your Labels with Unsupervised Transfer

Artyom Gadetsky; Yulun Jiang; Maria Brbic

Let Go of Your Labels with Unsupervised Transfer

Artyom Gadetsky, Yulun Jiang, Maria Brbic

TL;DR

The paper introduces TURTLE, a fully unsupervised method that discovers dataset labelings by maximizing margins across representation spaces of multiple foundation models, thereby enabling transfer without any supervision. Building on a generalization-based objective that biases toward max-margin solutions, TURTLE employs a bilevel optimization framework over multiple spaces with entropy-based regularization and alternating training. Empirically, TURTLE achieves state-of-the-art unsupervised performance on 26 vision datasets, and, with two representation spaces, often surpasses CLIP zero-shot while remaining competitive with supervised linear probes. The work demonstrates the strength of unsupervised transfer and suggests that richer, multi-model representations can significantly improve downstream labeling tasks without labeled data.

Abstract

Foundation vision-language models have enabled remarkable zero-shot transferability of the pre-trained representations to a wide range of downstream tasks. However, to solve a new task, zero-shot transfer still necessitates human guidance to define visual categories that appear in the data. Here, we show that fully unsupervised transfer emerges when searching for the labeling of a dataset that induces maximal margin classifiers in representation spaces of different foundation models. We present TURTLE, a fully unsupervised method that effectively employs this guiding principle to uncover the underlying labeling of a downstream dataset without any supervision and task-specific representation learning. We evaluate TURTLE on a diverse benchmark suite of 26 datasets and show that it achieves new state-of-the-art unsupervised performance. Furthermore, TURTLE, although being fully unsupervised, outperforms zero-shot transfer baselines on a wide range of datasets. In particular, TURTLE matches the average performance of CLIP zero-shot on 26 datasets by employing the same representation space, spanning a wide range of architectures and model sizes. By guiding the search for the underlying labeling using the representation spaces of two foundation models, TURTLE surpasses zero-shot transfer and unsupervised prompt tuning baselines, demonstrating the surprising power and effectiveness of unsupervised transfer.

Let Go of Your Labels with Unsupervised Transfer

TL;DR

Abstract

Paper Structure (21 sections, 3 theorems, 26 equations, 16 figures, 8 tables, 1 algorithm)

This paper contains 21 sections, 3 theorems, 26 equations, 16 figures, 8 tables, 1 algorithm.

Introduction
Background
Analysis of Generalization-Based Objective
TURTLE Framework
Experiments
Experimental setup
Results
Related Work
Conclusion
Proof of Proposition \ref{['hume_bound']}
Experimental Details
Datasets
Representations
Implementation Details
Details on Unsupervised Baselines and Numerical Results
...and 6 more sections

Key Result

Proposition 3.1

Given $M \gg 1$, $\theta \neq 0$ and appropriate step size $\eta$ which ensures convergence, then where $g(\theta) = (M \eta \exp(\|r_M(\theta)\|_2))^{-1}$, the residual $r_M(\theta)$ is bounded with $\lim_{M \to \infty} \|r_M(\theta)\|_2 = 0$, and $w_{\text{SVM}}(\theta)$ is the solution of the hard-margin SVM for a given $\theta$:

Figures (16)

Figure 1: Types of downstream transfer differ in the amount of available supervision. Given representation spaces of foundation models, (i) supervised transfer, represented as a linear probe, trains a linear classifier given labeled examples of a downstream dataset; (ii) zero-shot transfer assumes descriptions of the visual categories that appear in a downstream dataset are given, and employs them via text encoder to solve the task; and (iii) unsupervised transfer assumes the least amount of available supervision, i.e., only the number of categories is given, and aims to uncover the underlying human labeling of a dataset.
Figure 2: TURTLE outperforms unsupervised baselines. Comparison of TURTLE to unsupervised baselines with respect to accuracy. All methods use the CLIP ViT L/14 and DINOv2 representations. Bars represent the average performance with standard deviations computed over three runs.
Figure 3: TURTLE is an efficient method. Comparison of running time between TURTLE and unsupervised baselines. TURTLE employs efficient first-order optimization procedure, achieving more than 10$\times$ speedup compared to HUME. All methods use CLIP ViT L/14 and DINOv2 representations. Bars represent the average performance over three runs. Standard deviations are negligible (Table \ref{['tab:unsupervised_time']}) and omitted for clarity.
Figure 4: TURTLE enables unsupervised transfer given representation spaces of foundation models. Employing the same CLIP representation space, TURTLE closely matches the performance of the corresponding CLIP zero-shot classifier on average over 26 datasets. With the use of an additional representation space, TURTLE outperforms zero-shot transfer, demonstrating exceptional abilities of unsupervised transfer learning.
Figure 5: TURTLE outperforms the CLIP zero-shot classifier on 15 out of 26 datasets. TURTLE is trained with CLIP ViT-L/14 and DINOv2 representations. CLIP zero-shot utilizes the same CLIP ViT-L/14 architecture. Furthermore, we observe that even with only a single CLIP representation space TURTLE outperforms CLIP on $13/26$ datasets (Figure \ref{['fig:turtle_clip_zs']}).
...and 11 more figures

Theorems & Definitions (7)

Proposition 3.1
Remark 3.2
Remark 3.3
Remark 3.4
Proposition 1.3
Proposition 1.4
proof

Let Go of Your Labels with Unsupervised Transfer

TL;DR

Abstract

Let Go of Your Labels with Unsupervised Transfer

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (7)