Let Go of Your Labels with Unsupervised Transfer
Artyom Gadetsky, Yulun Jiang, Maria Brbic
TL;DR
The paper introduces TURTLE, a fully unsupervised method that discovers dataset labelings by maximizing margins across representation spaces of multiple foundation models, thereby enabling transfer without any supervision. Building on a generalization-based objective that biases toward max-margin solutions, TURTLE employs a bilevel optimization framework over multiple spaces with entropy-based regularization and alternating training. Empirically, TURTLE achieves state-of-the-art unsupervised performance on 26 vision datasets, and, with two representation spaces, often surpasses CLIP zero-shot while remaining competitive with supervised linear probes. The work demonstrates the strength of unsupervised transfer and suggests that richer, multi-model representations can significantly improve downstream labeling tasks without labeled data.
Abstract
Foundation vision-language models have enabled remarkable zero-shot transferability of the pre-trained representations to a wide range of downstream tasks. However, to solve a new task, zero-shot transfer still necessitates human guidance to define visual categories that appear in the data. Here, we show that fully unsupervised transfer emerges when searching for the labeling of a dataset that induces maximal margin classifiers in representation spaces of different foundation models. We present TURTLE, a fully unsupervised method that effectively employs this guiding principle to uncover the underlying labeling of a downstream dataset without any supervision and task-specific representation learning. We evaluate TURTLE on a diverse benchmark suite of 26 datasets and show that it achieves new state-of-the-art unsupervised performance. Furthermore, TURTLE, although being fully unsupervised, outperforms zero-shot transfer baselines on a wide range of datasets. In particular, TURTLE matches the average performance of CLIP zero-shot on 26 datasets by employing the same representation space, spanning a wide range of architectures and model sizes. By guiding the search for the underlying labeling using the representation spaces of two foundation models, TURTLE surpasses zero-shot transfer and unsupervised prompt tuning baselines, demonstrating the surprising power and effectiveness of unsupervised transfer.
