Hybrid Convolution and Vision Transformer NAS Search Space for TinyML Image Classification

Mikhael Djajapermana; Moritz Reiber; Daniel Mueller-Gritschneder; Ulf Schlichtmann

Hybrid Convolution and Vision Transformer NAS Search Space for TinyML Image Classification

Mikhael Djajapermana, Moritz Reiber, Daniel Mueller-Gritschneder, Ulf Schlichtmann

TL;DR

This work addresses the challenge of deploying accurate neural networks on tinyML devices by introducing a hybrid CNN-ViT NAS search space that includes a Pooling block for explicit downsampling and a Pool-ViT variant to reduce MHSA costs. Implemented in the HANNAH NAS framework and evaluated on CIFAR-10 under a tight model-size constraint of $100000$ parameters, the approach uses evolutionary search with proxy training followed by retraining the best candidate, and supports deployment via TVM on a RISC-V-like target. The results show that the discovered hybrid CNN-ViT architectures can achieve competitive accuracy while delivering lower latency and smaller memory footprints compared with ResNet-based tinyML models, particularly when Pool-ViT paths are used. This demonstrates the practical potential of combining CNN local-feature extraction with lightweight ViT components for efficient edge-device image classification, with future work including quantization, pruning, distillation, and NPU co-processing.

Abstract

Hybrids of Convolutional Neural Network (CNN) and Vision Transformer (ViT) have outperformed pure CNN or ViT architecture. However, since these architectures require large parameters and incur large computational costs, they are unsuitable for tinyML deployment. This paper introduces a new hybrid CNN-ViT search space for Neural Architecture Search (NAS) to find efficient hybrid architectures for image classification. The search space covers hybrid CNN and ViT blocks to learn local and global information, as well as the novel Pooling block of searchable pooling layers for efficient feature map reduction. Experimental results on the CIFAR10 dataset show that our proposed search space can produce hybrid CNN-ViT architectures with superior accuracy and inference speed to ResNet-based tinyML models under tight model size constraints.

Hybrid Convolution and Vision Transformer NAS Search Space for TinyML Image Classification

TL;DR

Abstract

Hybrid Convolution and Vision Transformer NAS Search Space for TinyML Image Classification

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)