Table of Contents
Fetching ...

Accelerating Augmentation Invariance Pretraining

Jinhong Lin, Cheng-En Wu, Yibing Wei, Pedro Morgado

TL;DR

This work proposes an acceleration framework for ViT, leveraging ViT's unique ability to generalize across inputs of varying sequence lengths, and employs a mix of sequence compression strategies to reduce the cost of gradient estimation and accelerate convergence.

Abstract

Our work tackles the computational challenges of contrastive learning methods, particularly for the pretraining of Vision Transformers (ViTs). Despite the effectiveness of contrastive learning, the substantial computational resources required for training often hinder their practical application. To mitigate this issue, we propose an acceleration framework, leveraging ViT's unique ability to generalize across inputs of varying sequence lengths. Our method employs a mix of sequence compression strategies, including randomized token dropout and flexible patch scaling, to reduce the cost of gradient estimation and accelerate convergence. We further provide an in-depth analysis of the gradient estimation error of various acceleration strategies as well as their impact on downstream tasks, offering valuable insights into the trade-offs between acceleration and performance. We also propose a novel procedure to identify an optimal acceleration schedule to adjust the sequence compression ratios to the training progress, ensuring efficient training without sacrificing downstream performance. Our approach significantly reduces computational overhead across various self-supervised learning algorithms on large-scale datasets. In ImageNet, our method achieves speedups of 4$\times$ in MoCo, 3.3$\times$ in SimCLR, and 2.5$\times$ in DINO, demonstrating substantial efficiency gains.

Accelerating Augmentation Invariance Pretraining

TL;DR

This work proposes an acceleration framework for ViT, leveraging ViT's unique ability to generalize across inputs of varying sequence lengths, and employs a mix of sequence compression strategies to reduce the cost of gradient estimation and accelerate convergence.

Abstract

Our work tackles the computational challenges of contrastive learning methods, particularly for the pretraining of Vision Transformers (ViTs). Despite the effectiveness of contrastive learning, the substantial computational resources required for training often hinder their practical application. To mitigate this issue, we propose an acceleration framework, leveraging ViT's unique ability to generalize across inputs of varying sequence lengths. Our method employs a mix of sequence compression strategies, including randomized token dropout and flexible patch scaling, to reduce the cost of gradient estimation and accelerate convergence. We further provide an in-depth analysis of the gradient estimation error of various acceleration strategies as well as their impact on downstream tasks, offering valuable insights into the trade-offs between acceleration and performance. We also propose a novel procedure to identify an optimal acceleration schedule to adjust the sequence compression ratios to the training progress, ensuring efficient training without sacrificing downstream performance. Our approach significantly reduces computational overhead across various self-supervised learning algorithms on large-scale datasets. In ImageNet, our method achieves speedups of 4 in MoCo, 3.3 in SimCLR, and 2.5 in DINO, demonstrating substantial efficiency gains.

Paper Structure

This paper contains 36 sections, 9 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Our accelerated MoCo-v3 achieves standard MoCo-v3 performance using only 1/5 of the training budget on ImageNet-100 and 1/3 on ImageNet-1k. The training budget (x-axis) is measured as the training time normalized by the forward pass of the base non-accelerated backbone model, in million (M) units. The results for ImageNet-100 are shown in \ref{['fig:in100_teaser']} and for ImageNet-1k in \ref{['fig:in1k_teaser']}.
  • Figure 1: Hardware-independent sample cost of different pre-training algorithms. We assume relatively short sequence lengths (typical of pre-training frameworks) where linear operations dominate over the quadratic self-attention operations.
  • Figure 2: Framework overview. We propose a method for accelerating augmentation invariance pre-training of transformer neural networks. Acceleration is achieved by compressing the ViT's input sequence length using two strategies: (1) randomized token dropout and (2) flexible patch scaling. We further introduce a gradient error analysis framework to assess the efficacy of an acceleration strategy, enabling us to define an optimal acceleration schedule that adjusts to the training progress. The acceleration strategy can be applied to a variety of methods. For example, SimCLR optimizes both encoders by gradient descent, while MoCo and DINO use a momentum encoder to compute the representations for the Key view. The loss function also differs across algorithms.
  • Figure 3: Accelerated MoCo-v3 sample costs for varying dropout rates and patch sizes. We assume uncompressed key sequences.
  • Figure 4: Error profile of accelerated gradients. From top to bottom, the three panels show the CA-MSE, squared bias and cost-adjusted variance of the gradient estimates, using different acceleration strategies and at different stages of training.
  • ...and 4 more figures