Table of Contents
Fetching ...

Let Go of Your Labels with Unsupervised Transfer

Artyom Gadetsky, Yulun Jiang, Maria Brbic

TL;DR

The paper introduces TURTLE, a fully unsupervised method that discovers dataset labelings by maximizing margins across representation spaces of multiple foundation models, thereby enabling transfer without any supervision. Building on a generalization-based objective that biases toward max-margin solutions, TURTLE employs a bilevel optimization framework over multiple spaces with entropy-based regularization and alternating training. Empirically, TURTLE achieves state-of-the-art unsupervised performance on 26 vision datasets, and, with two representation spaces, often surpasses CLIP zero-shot while remaining competitive with supervised linear probes. The work demonstrates the strength of unsupervised transfer and suggests that richer, multi-model representations can significantly improve downstream labeling tasks without labeled data.

Abstract

Foundation vision-language models have enabled remarkable zero-shot transferability of the pre-trained representations to a wide range of downstream tasks. However, to solve a new task, zero-shot transfer still necessitates human guidance to define visual categories that appear in the data. Here, we show that fully unsupervised transfer emerges when searching for the labeling of a dataset that induces maximal margin classifiers in representation spaces of different foundation models. We present TURTLE, a fully unsupervised method that effectively employs this guiding principle to uncover the underlying labeling of a downstream dataset without any supervision and task-specific representation learning. We evaluate TURTLE on a diverse benchmark suite of 26 datasets and show that it achieves new state-of-the-art unsupervised performance. Furthermore, TURTLE, although being fully unsupervised, outperforms zero-shot transfer baselines on a wide range of datasets. In particular, TURTLE matches the average performance of CLIP zero-shot on 26 datasets by employing the same representation space, spanning a wide range of architectures and model sizes. By guiding the search for the underlying labeling using the representation spaces of two foundation models, TURTLE surpasses zero-shot transfer and unsupervised prompt tuning baselines, demonstrating the surprising power and effectiveness of unsupervised transfer.

Let Go of Your Labels with Unsupervised Transfer

TL;DR

The paper introduces TURTLE, a fully unsupervised method that discovers dataset labelings by maximizing margins across representation spaces of multiple foundation models, thereby enabling transfer without any supervision. Building on a generalization-based objective that biases toward max-margin solutions, TURTLE employs a bilevel optimization framework over multiple spaces with entropy-based regularization and alternating training. Empirically, TURTLE achieves state-of-the-art unsupervised performance on 26 vision datasets, and, with two representation spaces, often surpasses CLIP zero-shot while remaining competitive with supervised linear probes. The work demonstrates the strength of unsupervised transfer and suggests that richer, multi-model representations can significantly improve downstream labeling tasks without labeled data.

Abstract

Foundation vision-language models have enabled remarkable zero-shot transferability of the pre-trained representations to a wide range of downstream tasks. However, to solve a new task, zero-shot transfer still necessitates human guidance to define visual categories that appear in the data. Here, we show that fully unsupervised transfer emerges when searching for the labeling of a dataset that induces maximal margin classifiers in representation spaces of different foundation models. We present TURTLE, a fully unsupervised method that effectively employs this guiding principle to uncover the underlying labeling of a downstream dataset without any supervision and task-specific representation learning. We evaluate TURTLE on a diverse benchmark suite of 26 datasets and show that it achieves new state-of-the-art unsupervised performance. Furthermore, TURTLE, although being fully unsupervised, outperforms zero-shot transfer baselines on a wide range of datasets. In particular, TURTLE matches the average performance of CLIP zero-shot on 26 datasets by employing the same representation space, spanning a wide range of architectures and model sizes. By guiding the search for the underlying labeling using the representation spaces of two foundation models, TURTLE surpasses zero-shot transfer and unsupervised prompt tuning baselines, demonstrating the surprising power and effectiveness of unsupervised transfer.
Paper Structure (21 sections, 3 theorems, 26 equations, 16 figures, 8 tables, 1 algorithm)

This paper contains 21 sections, 3 theorems, 26 equations, 16 figures, 8 tables, 1 algorithm.

Key Result

Proposition 3.1

Given $M \gg 1$, $\theta \neq 0$ and appropriate step size $\eta$ which ensures convergence, then where $g(\theta) = (M \eta \exp(\|r_M(\theta)\|_2))^{-1}$, the residual $r_M(\theta)$ is bounded with $\lim_{M \to \infty} \|r_M(\theta)\|_2 = 0$, and $w_{\text{SVM}}(\theta)$ is the solution of the hard-margin SVM for a given $\theta$:

Figures (16)

  • Figure 1: Types of downstream transfer differ in the amount of available supervision. Given representation spaces of foundation models, (i) supervised transfer, represented as a linear probe, trains a linear classifier given labeled examples of a downstream dataset; (ii) zero-shot transfer assumes descriptions of the visual categories that appear in a downstream dataset are given, and employs them via text encoder to solve the task; and (iii) unsupervised transfer assumes the least amount of available supervision, i.e., only the number of categories is given, and aims to uncover the underlying human labeling of a dataset.
  • Figure 2: TURTLE outperforms unsupervised baselines. Comparison of TURTLE to unsupervised baselines with respect to accuracy. All methods use the CLIP ViT L/14 and DINOv2 representations. Bars represent the average performance with standard deviations computed over three runs.
  • Figure 3: TURTLE is an efficient method. Comparison of running time between TURTLE and unsupervised baselines. TURTLE employs efficient first-order optimization procedure, achieving more than 10$\times$ speedup compared to HUME. All methods use CLIP ViT L/14 and DINOv2 representations. Bars represent the average performance over three runs. Standard deviations are negligible (Table \ref{['tab:unsupervised_time']}) and omitted for clarity.
  • Figure 4: TURTLE enables unsupervised transfer given representation spaces of foundation models. Employing the same CLIP representation space, TURTLE closely matches the performance of the corresponding CLIP zero-shot classifier on average over 26 datasets. With the use of an additional representation space, TURTLE outperforms zero-shot transfer, demonstrating exceptional abilities of unsupervised transfer learning.
  • Figure 5: TURTLE outperforms the CLIP zero-shot classifier on 15 out of 26 datasets. TURTLE is trained with CLIP ViT-L/14 and DINOv2 representations. CLIP zero-shot utilizes the same CLIP ViT-L/14 architecture. Furthermore, we observe that even with only a single CLIP representation space TURTLE outperforms CLIP on $13/26$ datasets (Figure \ref{['fig:turtle_clip_zs']}).
  • ...and 11 more figures

Theorems & Definitions (7)

  • Proposition 3.1
  • Remark 3.2
  • Remark 3.3
  • Remark 3.4
  • Proposition 1.3
  • Proposition 1.4
  • proof