Table of Contents
Fetching ...

Hybrid Convolution and Vision Transformer NAS Search Space for TinyML Image Classification

Mikhael Djajapermana, Moritz Reiber, Daniel Mueller-Gritschneder, Ulf Schlichtmann

TL;DR

This work addresses the challenge of deploying accurate neural networks on tinyML devices by introducing a hybrid CNN-ViT NAS search space that includes a Pooling block for explicit downsampling and a Pool-ViT variant to reduce MHSA costs. Implemented in the HANNAH NAS framework and evaluated on CIFAR-10 under a tight model-size constraint of $100000$ parameters, the approach uses evolutionary search with proxy training followed by retraining the best candidate, and supports deployment via TVM on a RISC-V-like target. The results show that the discovered hybrid CNN-ViT architectures can achieve competitive accuracy while delivering lower latency and smaller memory footprints compared with ResNet-based tinyML models, particularly when Pool-ViT paths are used. This demonstrates the practical potential of combining CNN local-feature extraction with lightweight ViT components for efficient edge-device image classification, with future work including quantization, pruning, distillation, and NPU co-processing.

Abstract

Hybrids of Convolutional Neural Network (CNN) and Vision Transformer (ViT) have outperformed pure CNN or ViT architecture. However, since these architectures require large parameters and incur large computational costs, they are unsuitable for tinyML deployment. This paper introduces a new hybrid CNN-ViT search space for Neural Architecture Search (NAS) to find efficient hybrid architectures for image classification. The search space covers hybrid CNN and ViT blocks to learn local and global information, as well as the novel Pooling block of searchable pooling layers for efficient feature map reduction. Experimental results on the CIFAR10 dataset show that our proposed search space can produce hybrid CNN-ViT architectures with superior accuracy and inference speed to ResNet-based tinyML models under tight model size constraints.

Hybrid Convolution and Vision Transformer NAS Search Space for TinyML Image Classification

TL;DR

This work addresses the challenge of deploying accurate neural networks on tinyML devices by introducing a hybrid CNN-ViT NAS search space that includes a Pooling block for explicit downsampling and a Pool-ViT variant to reduce MHSA costs. Implemented in the HANNAH NAS framework and evaluated on CIFAR-10 under a tight model-size constraint of parameters, the approach uses evolutionary search with proxy training followed by retraining the best candidate, and supports deployment via TVM on a RISC-V-like target. The results show that the discovered hybrid CNN-ViT architectures can achieve competitive accuracy while delivering lower latency and smaller memory footprints compared with ResNet-based tinyML models, particularly when Pool-ViT paths are used. This demonstrates the practical potential of combining CNN local-feature extraction with lightweight ViT components for efficient edge-device image classification, with future work including quantization, pruning, distillation, and NPU co-processing.

Abstract

Hybrids of Convolutional Neural Network (CNN) and Vision Transformer (ViT) have outperformed pure CNN or ViT architecture. However, since these architectures require large parameters and incur large computational costs, they are unsuitable for tinyML deployment. This paper introduces a new hybrid CNN-ViT search space for Neural Architecture Search (NAS) to find efficient hybrid architectures for image classification. The search space covers hybrid CNN and ViT blocks to learn local and global information, as well as the novel Pooling block of searchable pooling layers for efficient feature map reduction. Experimental results on the CIFAR10 dataset show that our proposed search space can produce hybrid CNN-ViT architectures with superior accuracy and inference speed to ResNet-based tinyML models under tight model size constraints.

Paper Structure

This paper contains 15 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of our proposed hybrid search space.
  • Figure 2: Types of CNN Blocks. The bottleneck structure is based on ResNet he2016deep, while the inverted bottleneck is based on MobileNetV2 sandler2018mobilenetv2.
  • Figure 3: Pooling Block with three types of pooling operators: (a) Max pooling, (c) Average pooling, and (b) a combination of Max and Average pooling.
  • Figure 4: Types of Hybrid ViT Blocks. The (a) ViT Block comprises an MHSA Block and an optional Feed-Forward (FF) Block.
  • Figure 5: Generated architecture candidates from four different search space designs on CIFAR10. The Pareto frontiers are connected with a line.
  • ...and 1 more figures