Table of Contents
Fetching ...

Navigating Efficiency in MobileViT through Gaussian Process on Global Architecture Factors

Ke Meng, Kai Chen

TL;DR

The paper tackles the challenge of deploying vision transformers on mobile devices by using Gaussian processes to model how global architecture factors—resolution $r$, width $w$, and block depths $d_i$ and $d_m$—nonlinearly affect performance under a MACs constraint. It introduces a 4D design-space navigation and a GP-based downsizing rule that yields smaller MobileViT V2 variants with competitive or improved accuracy, validated across ImageNet-100, TieredImageNet, CIFAR-10/100, and other datasets, and demonstrates reduced latency on mobile hardware. A 2D grid GP is employed to analyze joint-factor effects (e.g., resolution-width, resolution-depth) and to identify high-performing regions in the factor space. The approach leverages a Matérn kernel and NSGA-III Pareto-front sampling to derive practical guidelines for architecture compression, with results indicating that optimizing resolution and width yields substantial gains, enabling efficient mobile ViTs with real-world applicability.

Abstract

Numerous techniques have been meticulously designed to achieve optimal architectures for convolutional neural networks (CNNs), yet a comparable focus on vision transformers (ViTs) has been somewhat lacking. Despite the remarkable success of ViTs in various vision tasks, their heavyweight nature presents challenges of computational costs. In this paper, we leverage the Gaussian process to systematically explore the nonlinear and uncertain relationship between performance and global architecture factors of MobileViT, such as resolution, width, and depth including the depth of in-verted residual blocks and the depth of ViT blocks, and joint factors including resolution-depth and resolution-width. We present design principles twisting magic 4D cube of the global architecture factors that minimize model sizes and computational costs with higher model accuracy. We introduce a formula for downsizing architectures by iteratively deriving smaller MobileViT V2, all while adhering to a specified constraint of multiply-accumulate operations (MACs). Experiment results show that our formula significantly outperforms CNNs and mobile ViTs across diversified datasets

Navigating Efficiency in MobileViT through Gaussian Process on Global Architecture Factors

TL;DR

The paper tackles the challenge of deploying vision transformers on mobile devices by using Gaussian processes to model how global architecture factors—resolution , width , and block depths and —nonlinearly affect performance under a MACs constraint. It introduces a 4D design-space navigation and a GP-based downsizing rule that yields smaller MobileViT V2 variants with competitive or improved accuracy, validated across ImageNet-100, TieredImageNet, CIFAR-10/100, and other datasets, and demonstrates reduced latency on mobile hardware. A 2D grid GP is employed to analyze joint-factor effects (e.g., resolution-width, resolution-depth) and to identify high-performing regions in the factor space. The approach leverages a Matérn kernel and NSGA-III Pareto-front sampling to derive practical guidelines for architecture compression, with results indicating that optimizing resolution and width yields substantial gains, enabling efficient mobile ViTs with real-world applicability.

Abstract

Numerous techniques have been meticulously designed to achieve optimal architectures for convolutional neural networks (CNNs), yet a comparable focus on vision transformers (ViTs) has been somewhat lacking. Despite the remarkable success of ViTs in various vision tasks, their heavyweight nature presents challenges of computational costs. In this paper, we leverage the Gaussian process to systematically explore the nonlinear and uncertain relationship between performance and global architecture factors of MobileViT, such as resolution, width, and depth including the depth of in-verted residual blocks and the depth of ViT blocks, and joint factors including resolution-depth and resolution-width. We present design principles twisting magic 4D cube of the global architecture factors that minimize model sizes and computational costs with higher model accuracy. We introduce a formula for downsizing architectures by iteratively deriving smaller MobileViT V2, all while adhering to a specified constraint of multiply-accumulate operations (MACs). Experiment results show that our formula significantly outperforms CNNs and mobile ViTs across diversified datasets
Paper Structure (16 sections, 6 equations, 9 figures, 7 tables)

This paper contains 16 sections, 6 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: A ViT model comprises an input layer, position and patch embeddings, a linear projection layer, and a self-attention module. The self-attention module consists of a normalization layer, a multi-head attention layer, and a fully connected layer. The black arrows represent the connections and data flow between these layers.
  • Figure 2: The architecture of MobileViT. Notations Conv-$n\times n$ and MV2 denote standard $n\times n$ convolution and MobileNetV2 block respectively. Module with downsampling is indicated by $\downarrow 2$. The input feature goes through a convolutional layer that increases the channel dimension to a larger bottleneck channel size.
  • Figure 3: Diagram of separable self-attention. The resultant vector is normalized using softmax to produce context scores $\mathbf{c}_s$. These context scores are used to weight key tokens and produce a context vector $\mathbf{c}_v$, which encodes contextual information.
  • Figure 4: Fitted top 1 accuracy $\textit{v.s.} (r, d_i, d_m, w)$ for MobileViT V2 models with $\sim$ 2000M MACs on ImageNet-100. There are 60 observations of top 1 accuracy, together with a 95% predictive confidence interval (CI) for the GP regression model. Note also that the CI gets wider the further the predictions are extrapolated. The top 1 accuracy of the baseline MobileViT V2 model is labeled with a green triangle. Red crosses and black crosses denote the better models with higher accuracy than the baseline model and the inferior models with lower accuracy than the baseline, respectively.
  • Figure 5: Fitted Top 5 accuracy $\textit{v.s.} (r, d_i,d_m, w)$ for MobileViT V2 models with $\sim$ 2000M MACS on ImageNet-100. This plot follows the similar labels used in top 1 accuracy. The performance peak is obtained in the median value of the factor and has a low in the small value of the factor. The overall variation is smooth, but not of exactly linear shape.
  • ...and 4 more figures