Table of Contents
Fetching ...

No Train, all Gain: Self-Supervised Gradients Improve Deep Frozen Representations

Walter Simoncini, Spyros Gidaris, Andrei Bursuc, Yuki M. Asano

TL;DR

FUNGI features can benefit linear classification, clustering and image retrieval, and that they significantly improve the retrieval-based in-context scene understanding abilities of pretrained models, for example improving upon DINO by +17% for semantic segmentation - without any training.

Abstract

This paper introduces FUNGI, Features from UNsupervised GradIents, a method to enhance the features of transformer encoders by leveraging self-supervised gradients. Our method is simple: given any pretrained model, we first compute gradients from various self-supervised objectives for each input. These gradients are projected to a lower dimension and then concatenated with the model's output embedding. The resulting features are evaluated on k-nearest neighbor classification over 11 datasets from vision, 5 from natural language processing, and 2 from audio. Across backbones spanning various sizes and pretraining strategies, FUNGI features provide consistent performance improvements over the embeddings. We also show that using FUNGI features can benefit linear classification, clustering and image retrieval, and that they significantly improve the retrieval-based in-context scene understanding abilities of pretrained models, for example improving upon DINO by +17% for semantic segmentation - without any training.

No Train, all Gain: Self-Supervised Gradients Improve Deep Frozen Representations

TL;DR

FUNGI features can benefit linear classification, clustering and image retrieval, and that they significantly improve the retrieval-based in-context scene understanding abilities of pretrained models, for example improving upon DINO by +17% for semantic segmentation - without any training.

Abstract

This paper introduces FUNGI, Features from UNsupervised GradIents, a method to enhance the features of transformer encoders by leveraging self-supervised gradients. Our method is simple: given any pretrained model, we first compute gradients from various self-supervised objectives for each input. These gradients are projected to a lower dimension and then concatenated with the model's output embedding. The resulting features are evaluated on k-nearest neighbor classification over 11 datasets from vision, 5 from natural language processing, and 2 from audio. Across backbones spanning various sizes and pretraining strategies, FUNGI features provide consistent performance improvements over the embeddings. We also show that using FUNGI features can benefit linear classification, clustering and image retrieval, and that they significantly improve the retrieval-based in-context scene understanding abilities of pretrained models, for example improving upon DINO by +17% for semantic segmentation - without any training.
Paper Structure (48 sections, 8 equations, 12 figures, 27 tables, 1 algorithm)

This paper contains 48 sections, 8 equations, 12 figures, 27 tables, 1 algorithm.

Figures (12)

  • Figure 1: Gradient-augmented features: given a pretrained backbone $f_{\theta^*}$ and its embeddings, we apply a family of SSL losses, extract their gradients, and project and concatenate them. These new features are used to build a $k$-nearest neighbor index, which can be used for classification or retrieval.
  • Figure 2: Combining diverse features (embeddings, gradients) leads to large improvements. Pairwise CKA similarity of features (top) and the kNN accuracy of their combination (bottom).
  • Figure 3: Gradients encode different information. Delta in per-class kNN accuracy of gradients from different objectives compared to the embeddings, indicated as "Emb." in the plot.
  • Figure 4: Gradients extraction using a SimCLR loss. Given a pretrained backbone $f$ and a randomly initialized projection head $h$, we first patchify an image, obtain the latent representations of patches (1), calculate the SimCLR loss by maximizing the pairwise cosine similarity of patches, and minimizing their similarity to a fixed negatives batch and backpropagate (2), extract the per-sample gradients (3) and finally project the gradients to the same dimensionality as the embeddings (4).
  • Figure 5: Better data-efficiency. kNN accuracy of embeddings and fungi (using only KL and SimCLR gradients) on ImageNet-100 using a DeIT-B/16 backbone when only $k$ shots are used.
  • ...and 7 more figures