Table of Contents
Fetching ...

GSB: Group Superposition Binarization for Vision Transformer with Limited Training Samples

Tian Gao, Cheng-Zhong Xu, Le Zhang, Hui Kong

TL;DR

This work tackles overfitting and high resource demands of Vision Transformers when training data are scarce by introducing Group Superposition Binarization (GSB), a 1-bit binarization scheme that preserves information in attention and value matrices via grouped masks and learnable scales. It frames a baseline binarization method for linear layers and attention, analyzes why ViT binarization underperforms relative to CNN/MLP, and develops GSB to form a linear combination of binarized components for attention and value with STE-guided gradients. The method is complemented by a two-stage training regimen and hard-label Distillation from a pretrained teacher, enabling strong performance on small datasets (CIFAR-100, Oxford-Flowers102, Chaoyang) while drastically reducing computational costs. Empirical results show substantial accuracy gains over baselines, robustness to label noise, and major efficiency advantages, suggesting practical viability for edge devices and constrained environments. Overall, GSB provides a principled, scalable route to deploy binarized ViTs without sacrificing performance on limited-data tasks.

Abstract

Vision Transformer (ViT) has performed remarkably in various computer vision tasks. Nonetheless, affected by the massive amount of parameters, ViT usually suffers from serious overfitting problems with a relatively limited number of training samples. In addition, ViT generally demands heavy computing resources, which limit its deployment on resource-constrained devices. As a type of model-compression method, model binarization is potentially a good choice to solve the above problems. Compared with the full-precision one, the model with the binarization method replaces complex tensor multiplication with simple bit-wise binary operations and represents full-precision model parameters and activations with only 1-bit ones, which potentially solves the problem of model size and computational complexity, respectively. In this paper, we investigate a binarized ViT model. Empirically, we observe that the existing binarization technology designed for Convolutional Neural Networks (CNN) cannot migrate well to a ViT's binarization task. We also find that the decline of the accuracy of the binary ViT model is mainly due to the information loss of the Attention module and the Value vector. Therefore, we propose a novel model binarization technique, called Group Superposition Binarization (GSB), to deal with these issues. Furthermore, in order to further improve the performance of the binarization model, we have investigated the gradient calculation procedure in the binarization process and derived more proper gradient calculation equations for GSB to reduce the influence of gradient mismatch. Then, the knowledge distillation technique is introduced to alleviate the performance degradation caused by model binarization. Analytically, model binarization can limit the parameters search space during parameter updates while training a model....

GSB: Group Superposition Binarization for Vision Transformer with Limited Training Samples

TL;DR

This work tackles overfitting and high resource demands of Vision Transformers when training data are scarce by introducing Group Superposition Binarization (GSB), a 1-bit binarization scheme that preserves information in attention and value matrices via grouped masks and learnable scales. It frames a baseline binarization method for linear layers and attention, analyzes why ViT binarization underperforms relative to CNN/MLP, and develops GSB to form a linear combination of binarized components for attention and value with STE-guided gradients. The method is complemented by a two-stage training regimen and hard-label Distillation from a pretrained teacher, enabling strong performance on small datasets (CIFAR-100, Oxford-Flowers102, Chaoyang) while drastically reducing computational costs. Empirical results show substantial accuracy gains over baselines, robustness to label noise, and major efficiency advantages, suggesting practical viability for edge devices and constrained environments. Overall, GSB provides a principled, scalable route to deploy binarized ViTs without sacrificing performance on limited-data tasks.

Abstract

Vision Transformer (ViT) has performed remarkably in various computer vision tasks. Nonetheless, affected by the massive amount of parameters, ViT usually suffers from serious overfitting problems with a relatively limited number of training samples. In addition, ViT generally demands heavy computing resources, which limit its deployment on resource-constrained devices. As a type of model-compression method, model binarization is potentially a good choice to solve the above problems. Compared with the full-precision one, the model with the binarization method replaces complex tensor multiplication with simple bit-wise binary operations and represents full-precision model parameters and activations with only 1-bit ones, which potentially solves the problem of model size and computational complexity, respectively. In this paper, we investigate a binarized ViT model. Empirically, we observe that the existing binarization technology designed for Convolutional Neural Networks (CNN) cannot migrate well to a ViT's binarization task. We also find that the decline of the accuracy of the binary ViT model is mainly due to the information loss of the Attention module and the Value vector. Therefore, we propose a novel model binarization technique, called Group Superposition Binarization (GSB), to deal with these issues. Furthermore, in order to further improve the performance of the binarization model, we have investigated the gradient calculation procedure in the binarization process and derived more proper gradient calculation equations for GSB to reduce the influence of gradient mismatch. Then, the knowledge distillation technique is introduced to alleviate the performance degradation caused by model binarization. Analytically, model binarization can limit the parameters search space during parameter updates while training a model....
Paper Structure (18 sections, 37 equations, 17 figures, 9 tables)

This paper contains 18 sections, 37 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: An illustrative example to show the effect of model binarization on reducing overfitting problems. This example shows a quadratic model. We binarize the parameters $a$, $b$, and $c$. Before binarization, the model can potentially take an infinite number of functions. After binarization, the model can only take eight possible functions. Therefore, model binarization can limit the parameter's search space during parameter update while training a model, which can play an implicit regularization role and may help solve the problem of overfitting in the case of insufficient training data santos2022avoidingying2019overviewrice2020overfittingwan2013regularizationsrivastava2014dropout.
  • Figure 2: The binarization process of the linear layer. $\mathbf{Y}\in \mathbb{R}^{d\times n}$ indicates the output of the linear layer. $\mathbf{W}\in \mathbb{R}^{d\times b}$ and $\mathbf{A}\in \mathbb{R}^{b\times n}$ refer to the full-precision weight and the activation, respectively. We take $n=4$, $d=5$, and $b=4$ as an example. The batch size is 1. The $xnor$ and $popcount$ operations with 0-valued vectors can refer to the work qin2022bibert.
  • Figure 3: The bitwise $xnor$ and $popcount$ operation.
  • Figure 4: The distribution of the values of the attention matrix closest to the output layer (the layer that outputs class probabilities). The model utilized is our baseline at the initialization stage. (a) The distribution of the values of $\mathbf{A_{re}}$. (b) The distribution of the values of the binarized attention matrix that has been scaled by the scaling factor defined by Eq.\ref{['eq1-2']}.
  • Figure 5: The performance comparison of different modules of ViT after binarization on the CIFAR-100 dataset. The performance of the full-precision model is represented by the horizontal dashed line. The performance of all compared models is obtained by training 600 epochs from scratch on the dataset. The base model is DeiT-small touvron2021training without distillation, which can be seen as a ViT with strong data augmentation.
  • ...and 12 more figures