Table of Contents
Fetching ...

TrAct: Making First-layer Pre-Activations Trainable

Felix Petersen, Christian Borgelt, Stefano Ermon

TL;DR

This work proposes the conceptual procedure of a gradient descent step on first layer activations to construct an activation proposal, and finds the optimal weights of the first layer, i.e., those weights which minimize the squared distance to the activation proposal.

Abstract

We consider the training of the first layer of vision models and notice the clear relationship between pixel values and gradient update magnitudes: the gradients arriving at the weights of a first layer are by definition directly proportional to (normalized) input pixel values. Thus, an image with low contrast has a smaller impact on learning than an image with higher contrast, and a very bright or very dark image has a stronger impact on the weights than an image with moderate brightness. In this work, we propose performing gradient descent on the embeddings produced by the first layer of the model. However, switching to discrete inputs with an embedding layer is not a reasonable option for vision models. Thus, we propose the conceptual procedure of (i) a gradient descent step on first layer activations to construct an activation proposal, and (ii) finding the optimal weights of the first layer, i.e., those weights which minimize the squared distance to the activation proposal. We provide a closed form solution of the procedure and adjust it for robust stochastic training while computing everything efficiently. Empirically, we find that TrAct (Training Activations) speeds up training by factors between 1.25x and 4x while requiring only a small computational overhead. We demonstrate the utility of TrAct with different optimizers for a range of different vision models including convolutional and transformer architectures.

TrAct: Making First-layer Pre-Activations Trainable

TL;DR

This work proposes the conceptual procedure of a gradient descent step on first layer activations to construct an activation proposal, and finds the optimal weights of the first layer, i.e., those weights which minimize the squared distance to the activation proposal.

Abstract

We consider the training of the first layer of vision models and notice the clear relationship between pixel values and gradient update magnitudes: the gradients arriving at the weights of a first layer are by definition directly proportional to (normalized) input pixel values. Thus, an image with low contrast has a smaller impact on learning than an image with higher contrast, and a very bright or very dark image has a stronger impact on the weights than an image with moderate brightness. In this work, we propose performing gradient descent on the embeddings produced by the first layer of the model. However, switching to discrete inputs with an embedding layer is not a reasonable option for vision models. Thus, we propose the conceptual procedure of (i) a gradient descent step on first layer activations to construct an activation proposal, and (ii) finding the optimal weights of the first layer, i.e., those weights which minimize the squared distance to the activation proposal. We provide a closed form solution of the procedure and adjust it for robust stochastic training while computing everything efficiently. Empirically, we find that TrAct (Training Activations) speeds up training by factors between 1.25x and 4x while requiring only a small computational overhead. We demonstrate the utility of TrAct with different optimizers for a range of different vision models including convolutional and transformer architectures.

Paper Structure

This paper contains 22 sections, 4 theorems, 19 equations, 12 figures, 7 tables.

Key Result

Lemma 1

The solution $\Delta W_i^\star$ of Equation eq:problem-wi is Proof deferred to Supplementary Material sm:theory.

Figures (12)

  • Figure 1: TrAct learns the first layer of a vision model but with the training dynamics of an embedding layer. We illustrate this in an example with two 4-dimensional inputs $x$, a weight matrix $W$ of size $4\times 3$, and resulting pre-activations $z$ of size $2\times3$. For language models (left), the input $x$ is two tokens from a dictionary of size 4. For vision models (center + right), the input $x$ is two patches of the image, each totaling 4 pixels. During backpropagation, we obtain the gradient wrt. our pre-activations $\nabla z$, from which the gradient and update to the weights $W$ is computed ($\Delta W$). The resulting update to the pre-activations $\Delta z$ equals $x^\top\cdot \Delta W$. For language models (left), $\Delta z=\nabla z$, i.e., the training dynamics of the embeddings layer corresponds to updating the embeddings directly wrt. the gradient. Specifically, the update in a language model, for a token identifier $i$, is $W_i \leftarrow W_i - \eta \cdot \nabla_{z} \mathcal{L}(z)$ where $z = W_i$ is the activation of the first layer and at the same time the $i$th row of the embedding (weight) matrix $W$. Equivalently, we can write $z \leftarrow z - \eta \cdot \nabla_{z} \mathcal{L}(z)$. However, in vision models (center), the update $\Delta z$ strongly deviates from the respective gradients $\nabla z$. TrAct corrects for this by adjusting $\Delta W$ via a corrective term $(x \cdot x^\top + \lambda\cdot I)^{-1}$ (orange box), such that the update to $z$ closely approximates $\nabla z$.
  • Figure 2: Implementation of TrAct, where l corresponds to the hyperparameter $\lambda$.
  • Figure 3: Training a ResNet-18 on CIFAR-10. We train for $\{100,200,400,800\}$ epochs using a cosine learning rate schedule and with SGD (left) and Adam (right). Learning rates have been selected as optimal for each baseline. Averaged over 5 seeds. TrAct (solid lines) consistently outperforms the baselines (dashed)---in many cases already with a quarter of the number of the epochs of the baseline.
  • Figure 4: Training a ViT on CIFAR-10. We train for $\{100,200,400,800\}$ epochs using a cosine learning rate schedule and with Adam. Learning rates have been selected as optimal for each baseline. Avg. over 5 seeds.
  • Figure 5: Test accuracy of ResNet-50 trained on ImageNet for $\{30, 60, 90\}$ epochs. When training for $60$ epochs with TrAct, we achieve comparable accuracy to standard training for $90$ epochs, showing a $1.5\times$ speedup. Plots for ResNet-18/34 are in the SM.
  • ...and 7 more figures

Theorems & Definitions (6)

  • Lemma 1
  • Lemma 2
  • Lemma 1
  • proof
  • Lemma 2
  • proof