Table of Contents
Fetching ...

Contrastive Forward-Forward: A Training Algorithm of Vision Transformer

Hossein Aghagolzadeh, Mehdi Ezoji

TL;DR

This work addresses the training efficiency and biological plausibility gap of backpropagation by extending Forward-Forward to Vision Transformers with Contrastive Forward-Forward (CFF). By replacing FF's local losses with a supervised contrastive objective and adopting a two-branch data flow akin to contrastive learning, the method achieves higher accuracy and much faster convergence, while remaining competitive with backpropagation under various conditions, including inaccurate supervision. A Marginal Contrastive Loss is introduced to progressively tighten same-class representations across layers, and the approach is validated on ViT architectures across multiple datasets, with notable gains in convergence speed and inference efficiency. The results demonstrate the practical potential of brain-inspired, layer-wise, contrastive training for large-scale vision models and highlight opportunities for parallelization, robustness, and applicability beyond simple architectures.

Abstract

Although backpropagation is widely accepted as a training algorithm for artificial neural networks, researchers are always looking for inspiration from the brain to find ways with potentially better performance. Forward-Forward is a novel training algorithm that is more similar to what occurs in the brain, although there is a significant performance gap compared to backpropagation. In the Forward-Forward algorithm, the loss functions are placed after each layer, and the updating of a layer is done using two local forward passes and one local backward pass. Forward-Forward is in its early stages and has been designed and evaluated on simple multi-layer perceptron networks to solve image classification tasks. In this work, we have extended the use of this algorithm to a more complex and modern network, namely the Vision Transformer. Inspired by insights from contrastive learning, we have attempted to revise this algorithm, leading to the introduction of Contrastive Forward-Forward. Experimental results show that our proposed algorithm performs significantly better than the baseline Forward-Forward leading to an increase of up to 10% in accuracy and accelerating the convergence speed by 5 to 20 times. Furthermore, if we take Cross Entropy as the baseline loss function in backpropagation, it will be demonstrated that the proposed modifications to the baseline Forward-Forward reduce its performance gap compared to backpropagation on Vision Transformer, and even outperforms it in certain conditions, such as inaccurate supervision.

Contrastive Forward-Forward: A Training Algorithm of Vision Transformer

TL;DR

This work addresses the training efficiency and biological plausibility gap of backpropagation by extending Forward-Forward to Vision Transformers with Contrastive Forward-Forward (CFF). By replacing FF's local losses with a supervised contrastive objective and adopting a two-branch data flow akin to contrastive learning, the method achieves higher accuracy and much faster convergence, while remaining competitive with backpropagation under various conditions, including inaccurate supervision. A Marginal Contrastive Loss is introduced to progressively tighten same-class representations across layers, and the approach is validated on ViT architectures across multiple datasets, with notable gains in convergence speed and inference efficiency. The results demonstrate the practical potential of brain-inspired, layer-wise, contrastive training for large-scale vision models and highlight opportunities for parallelization, robustness, and applicability beyond simple architectures.

Abstract

Although backpropagation is widely accepted as a training algorithm for artificial neural networks, researchers are always looking for inspiration from the brain to find ways with potentially better performance. Forward-Forward is a novel training algorithm that is more similar to what occurs in the brain, although there is a significant performance gap compared to backpropagation. In the Forward-Forward algorithm, the loss functions are placed after each layer, and the updating of a layer is done using two local forward passes and one local backward pass. Forward-Forward is in its early stages and has been designed and evaluated on simple multi-layer perceptron networks to solve image classification tasks. In this work, we have extended the use of this algorithm to a more complex and modern network, namely the Vision Transformer. Inspired by insights from contrastive learning, we have attempted to revise this algorithm, leading to the introduction of Contrastive Forward-Forward. Experimental results show that our proposed algorithm performs significantly better than the baseline Forward-Forward leading to an increase of up to 10% in accuracy and accelerating the convergence speed by 5 to 20 times. Furthermore, if we take Cross Entropy as the baseline loss function in backpropagation, it will be demonstrated that the proposed modifications to the baseline Forward-Forward reduce its performance gap compared to backpropagation on Vision Transformer, and even outperforms it in certain conditions, such as inaccurate supervision.

Paper Structure

This paper contains 38 sections, 31 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Abstract illustrations of backpropagation rumelhart1986learning with Cross-Entropy loss function, baseline Forward-Forward hinton2022forward, Supervised Contrastive Learning khosla2020supervised and the proposed method, Contrastive Forward-Forward.
  • Figure 2: Label Representation Strategies.
  • Figure 3: An illustration of the proposed training algorithm applied to ViT dosovitskiy2020image. Each "Encoder Layer" is considered a layer of the network. For each layer, there are two local forward passes and one local backward pass. There is only one encoder in "Stage 1", and the bottom network is displayed for a better illustration of the algorithm; in fact, the layer weights in the corresponding layers of the top and bottom networks are shared.
  • Figure 4: Margins tuning of proposed loss function with validation set, A: (ViT[128 4 5], CIFAR10, RC), B: (ViT[192 6 6], CIFAR10, RC), C: (ViT[192 6 6], CIFAR10, RC + RA), D: (ViT[192 6 6], CIFAR100, RC + RA) and E: (ViT[240 6 7], CIFAR100, RC + RA), NM: No Margin, RC: Random Corp, RA: RandAug cubuk2020randaugment
  • Figure 5: A layer-wise analysis of the training process of the proposed method (CFF+M) with ViT[128 4 5] on CIFAR-10. The first to fifth columns represent the first to fifth layers of the network, respectively. Row 1: The value of the Fisher criterion after training on the representations outputted from each layer with validation data. Row 2: 2D visualizations of the representations from each layer with validation data using t-SNE transformation. Row 3: The right vertical axis shows the loss, and the left vertical axis shows the percentage of samples from $P(k)$ that fall into $R(k)$. The summation operation is performed over each possible $k$, which includes indices of all samples in the batch. The values of both charts are derived from the average of ten trials. Row 4: The cosine similarity between an arbitrary anchor and two positive and negative samples.
  • ...and 3 more figures