Table of Contents
Fetching ...

SCHEME: Scalable Channel Mixer for Vision Transformers

Deepak Sridhar, Yunsheng Li, Nuno Vasconcelos

TL;DR

The paper targets the ViT channel mixer bottleneck by introducing the SCHEME module, which replaces dense MLPs with a scalable block-diagonal MLP (BD-MLP) to allow larger expansion $E$ and pairs it with a training-time Channel Covariance Attention (CCA) that fosters cross-channel feature clustering. Crucially, CCA is discarded at inference, yielding no extra cost while improving training and final accuracy. The SCHEME module is instantiated into a family of SCHEMEformer models by substituting SCHEME into a representative ViT (Metaformer-PPAA-S12), achieving new Pareto frontiers for accuracy vs FLOPs, model size, and throughput across 12 backbones and tasks (classification, detection, segmentation). The results show substantial gains over state-of-the-art ViTs, particularly in low-complexity regimes, with up to 1.5% top-1 accuracy improvements and up to 20% throughput gains, and across downstream tasks. Overall, SCHEME provides a flexible, efficient channel mixer that improves the computation-accuracy trade-off and broadens ViT deployment opportunities.

Abstract

Vision Transformers have achieved impressive performance in many vision tasks. While the token mixer or attention block has been studied in great detail, much less research has been devoted to the channel mixer or feature mixing block (FFN or MLP), which accounts for a significant portion of the model parameters and computation. In this work, we show that the dense MLP connections can be replaced with a sparse block diagonal MLP structure that supports larger expansion ratios by splitting MLP features into groups. To improve the feature clusters formed by this structure we propose the use of a lightweight, parameter-free, channel covariance attention (CCA) mechanism as a parallel branch during training. This enables gradual feature mixing across channel groups during training whose contribution decays to zero as the training progresses to convergence. As a result, the CCA block can be discarded during inference, enabling enhanced performance at no additional computational cost. The resulting $\textit{Scalable CHannEl MixEr}$ (SCHEME) can be plugged into any ViT architecture to obtain a gamut of models with different trade-offs between complexity and performance by controlling the block diagonal MLP structure. This is shown by the introduction of a new family of SCHEMEformer models. Experiments on image classification, object detection, and semantic segmentation, with $\textbf{12 different ViT backbones}$, consistently demonstrate substantial accuracy/latency gains (upto $\textbf{1.5\% /20\%})$ over existing designs, especially for lower complexity regimes. The SCHEMEformer family is shown to establish new Pareto frontiers for accuracy vs FLOPS, accuracy vs model size, and accuracy vs throughput, especially for fast transformers of small size.

SCHEME: Scalable Channel Mixer for Vision Transformers

TL;DR

The paper targets the ViT channel mixer bottleneck by introducing the SCHEME module, which replaces dense MLPs with a scalable block-diagonal MLP (BD-MLP) to allow larger expansion and pairs it with a training-time Channel Covariance Attention (CCA) that fosters cross-channel feature clustering. Crucially, CCA is discarded at inference, yielding no extra cost while improving training and final accuracy. The SCHEME module is instantiated into a family of SCHEMEformer models by substituting SCHEME into a representative ViT (Metaformer-PPAA-S12), achieving new Pareto frontiers for accuracy vs FLOPs, model size, and throughput across 12 backbones and tasks (classification, detection, segmentation). The results show substantial gains over state-of-the-art ViTs, particularly in low-complexity regimes, with up to 1.5% top-1 accuracy improvements and up to 20% throughput gains, and across downstream tasks. Overall, SCHEME provides a flexible, efficient channel mixer that improves the computation-accuracy trade-off and broadens ViT deployment opportunities.

Abstract

Vision Transformers have achieved impressive performance in many vision tasks. While the token mixer or attention block has been studied in great detail, much less research has been devoted to the channel mixer or feature mixing block (FFN or MLP), which accounts for a significant portion of the model parameters and computation. In this work, we show that the dense MLP connections can be replaced with a sparse block diagonal MLP structure that supports larger expansion ratios by splitting MLP features into groups. To improve the feature clusters formed by this structure we propose the use of a lightweight, parameter-free, channel covariance attention (CCA) mechanism as a parallel branch during training. This enables gradual feature mixing across channel groups during training whose contribution decays to zero as the training progresses to convergence. As a result, the CCA block can be discarded during inference, enabling enhanced performance at no additional computational cost. The resulting (SCHEME) can be plugged into any ViT architecture to obtain a gamut of models with different trade-offs between complexity and performance by controlling the block diagonal MLP structure. This is shown by the introduction of a new family of SCHEMEformer models. Experiments on image classification, object detection, and semantic segmentation, with , consistently demonstrate substantial accuracy/latency gains (upto over existing designs, especially for lower complexity regimes. The SCHEMEformer family is shown to establish new Pareto frontiers for accuracy vs FLOPS, accuracy vs model size, and accuracy vs throughput, especially for fast transformers of small size.
Paper Structure (23 sections, 4 equations, 20 figures, 17 tables)

This paper contains 23 sections, 4 equations, 20 figures, 17 tables.

Figures (20)

  • Figure 1: Comparison of the proposed SCHEMEformer family, derived from the Metaformer-PPAA-S12 model yu2022metaformer with higher expansion ratios in the MLP blocks, and many SOTA transformers from the literature. The SCHEMEFormer family establishes a new Pareto frontier (optimal trade-off) for a) accuracy vs. FLOPs, b) accuracy vs model size, and c) accuracy vs, throughput. SCHEMEformer models are particularly effective for the design of fast transformers (throughput between 75 and 150 images/s) with small model size. See Appendix Figure \ref{['acc_flops_supp']} for zoomed version.
  • Figure 1: MetaFormer-S12 yu2022metaformer ImageNet-1K validation accuracy vs MLP expansion ratio ($E$). ${*}$: results by author code.
  • Figure 2: Proposed SCHEME channel mixer. The channel mixer of the standard transformer consists of two MLP layers, performing dimensionality expansion and reduction by a factor of $E$. SCHEME uses a combination of a block diagonal MLP (BD-MLP), which reduces the complexity of the MLP layers by using block diagonal weights, and a channel covariance attention (CCA) mechanism that enables communication across feature groups through feature-based attention. This, however, is only needed for training. The weights $1-\alpha$ decay to zero upon training convergence and CCA can be discarded during inference, as shown on the right. Experiments show that CCA helps learn better feature clusters, but is not needed once these are formed.
  • Figure 3: Impact of CCA (SCHEMEformer-44-e8-S12). Left: Evolution of weight $1-\alpha$ across model layers. Right: Class separability of output features $\bf y$ (over 50 random classes of ImageNet-1K validation set) for model trained with and without CCA. See Appendix Figure \ref{['fig:tradeoff_supp']} for zoomed version.
  • Figure 4: Image Classification on ImageNet-1K. Comparison of SCHEME models using expansion ratio 8 with SOTA ViTs grouped by accuracy. SCHEME family has higher throughput and accuracy than SOTA models.
  • ...and 15 more figures