Table of Contents
Fetching ...

Learning from Offline Foundation Features with Tensor Augmentations

Emir Konuk, Christos Matsoukas, Moein Sorkhei, Phitchapha Lertsiravaramet, Kevin Smith

TL;DR

LOFF-TA makes it possible to leverage the power of foundation models, regardless of their size, in settings with limited computational capacity, and can be used to apply foundation models to high-resolution images without increasing compute.

Abstract

We introduce Learning from Offline Foundation Features with Tensor Augmentations (LOFF-TA), an efficient training scheme designed to harness the capabilities of foundation models in limited resource settings where their direct development is not feasible. LOFF-TA involves training a compact classifier on cached feature embeddings from a frozen foundation model, resulting in up to $37\times$ faster training and up to $26\times$ reduced GPU memory usage. Because the embeddings of augmented images would be too numerous to store, yet the augmentation process is essential for training, we propose to apply tensor augmentations to the cached embeddings of the original non-augmented images. LOFF-TA makes it possible to leverage the power of foundation models, regardless of their size, in settings with limited computational capacity. Moreover, LOFF-TA can be used to apply foundation models to high-resolution images without increasing compute. In certain scenarios, we find that training with LOFF-TA yields better results than directly fine-tuning the foundation model.

Learning from Offline Foundation Features with Tensor Augmentations

TL;DR

LOFF-TA makes it possible to leverage the power of foundation models, regardless of their size, in settings with limited computational capacity, and can be used to apply foundation models to high-resolution images without increasing compute.

Abstract

We introduce Learning from Offline Foundation Features with Tensor Augmentations (LOFF-TA), an efficient training scheme designed to harness the capabilities of foundation models in limited resource settings where their direct development is not feasible. LOFF-TA involves training a compact classifier on cached feature embeddings from a frozen foundation model, resulting in up to faster training and up to reduced GPU memory usage. Because the embeddings of augmented images would be too numerous to store, yet the augmentation process is essential for training, we propose to apply tensor augmentations to the cached embeddings of the original non-augmented images. LOFF-TA makes it possible to leverage the power of foundation models, regardless of their size, in settings with limited computational capacity. Moreover, LOFF-TA can be used to apply foundation models to high-resolution images without increasing compute. In certain scenarios, we find that training with LOFF-TA yields better results than directly fine-tuning the foundation model.
Paper Structure (21 sections, 4 equations, 4 figures, 6 tables)

This paper contains 21 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Learning from Offline Foundation Features with Tensor Augmentations (LOFF-TA). Training data is passed through a foundation model and cached. The cached embeddings are loaded and spatial tensor augmentations are applied in lieu of standard image augmentations. A lightweight classifier is trained on the cached, augmented features. This enables the use of arbitrarily large foundation models and high-resolution images at no additional cost.
  • Figure 2: Overview of LOFF-TA.Step 1: We leverage a foundation model to process the training data and store the extracted features offline. Step 2: The cached tensors are loaded, tensor augmentations are applied, then the augmented tensors are passed through projection and normalization layers and used to train a lightweight classifier. The tensor augmentations include spatial-based transforms, such as flips and crops, along with additive Gaussian noise. An optional pooling step (dashed operation) reduces the spatial dimension of the stored features, allowing for training with high-resolution images at no additional cost.
  • Figure 3: CKA similarities between different models.Left: Representation similarity of different classifiers after fine-tuning on Oxford-III-Pet. Right: Representation similarity of the internal layers of each classifier with itself before and after fine-tuning.
  • Figure 4: Robustness and spatial consistency of features. Images along with a random channel of the corresponding foundation features reveal the spatial consistency between objects in the image and feature spaces. This consistency allows insights from the image space to guide tensor augmentation choice, e.g. if vertical flips are harmful for a building facade dataset in image space, they are likely to be harmful in feature space. However, we observe that training with LOFF-TA is more robust against 'incorrect' augmentation choices compared to standard classifier training on images.