Navigating Efficiency in MobileViT through Gaussian Process on Global Architecture Factors
Ke Meng, Kai Chen
TL;DR
The paper tackles the challenge of deploying vision transformers on mobile devices by using Gaussian processes to model how global architecture factors—resolution $r$, width $w$, and block depths $d_i$ and $d_m$—nonlinearly affect performance under a MACs constraint. It introduces a 4D design-space navigation and a GP-based downsizing rule that yields smaller MobileViT V2 variants with competitive or improved accuracy, validated across ImageNet-100, TieredImageNet, CIFAR-10/100, and other datasets, and demonstrates reduced latency on mobile hardware. A 2D grid GP is employed to analyze joint-factor effects (e.g., resolution-width, resolution-depth) and to identify high-performing regions in the factor space. The approach leverages a Matérn kernel and NSGA-III Pareto-front sampling to derive practical guidelines for architecture compression, with results indicating that optimizing resolution and width yields substantial gains, enabling efficient mobile ViTs with real-world applicability.
Abstract
Numerous techniques have been meticulously designed to achieve optimal architectures for convolutional neural networks (CNNs), yet a comparable focus on vision transformers (ViTs) has been somewhat lacking. Despite the remarkable success of ViTs in various vision tasks, their heavyweight nature presents challenges of computational costs. In this paper, we leverage the Gaussian process to systematically explore the nonlinear and uncertain relationship between performance and global architecture factors of MobileViT, such as resolution, width, and depth including the depth of in-verted residual blocks and the depth of ViT blocks, and joint factors including resolution-depth and resolution-width. We present design principles twisting magic 4D cube of the global architecture factors that minimize model sizes and computational costs with higher model accuracy. We introduce a formula for downsizing architectures by iteratively deriving smaller MobileViT V2, all while adhering to a specified constraint of multiply-accumulate operations (MACs). Experiment results show that our formula significantly outperforms CNNs and mobile ViTs across diversified datasets
