Table of Contents
Fetching ...

Towards Vision Mixture of Experts for Wildlife Monitoring on the Edge

Emmanuel Azuh Mensah, Anderson Lee, Haoran Zhang, Yitong Shan, Kurtis Heimerl

TL;DR

Inspired by recent work on data driven conditional computation of subnetworks, a similar per patch conditional computation for the first time for mobile vision transformers (vision only case), that will eventually be used for single-tower multimodal edge models is explored.

Abstract

The explosion of IoT sensors in industrial, consumer and remote sensing use cases has come with unprecedented demand for computing infrastructure to transmit and to analyze petabytes of data. Concurrently, the world is slowly shifting its focus towards more sustainable computing. For these reasons, there has been a recent effort to reduce the footprint of related computing infrastructure, especially by deep learning algorithms, for advanced insight generation. The `TinyML' community is actively proposing methods to save communication bandwidth and excessive cloud storage costs while reducing algorithm inference latency and promoting data privacy. Such proposed approaches should ideally process multiple types of data, including time series, audio, satellite images, and video, near the network edge as multiple data streams has been shown to improve the discriminative ability of learning algorithms, especially for generating fine grained results. Incidentally, there has been recent work on data driven conditional computation of subnetworks that has shown real progress in using a single model to share parameters among very different types of inputs such as images and text, reducing the computation requirement of multi-tower multimodal networks. Inspired by such line of work, we explore similar per patch conditional computation for the first time for mobile vision transformers (vision only case), that will eventually be used for single-tower multimodal edge models. We evaluate the model on Cornell Sap Sucker Woods 60, a fine grained bird species discrimination dataset. Our initial experiments uses $4X$ fewer parameters compared to MobileViTV2-1.0 with a $1$% accuracy drop on the iNaturalist '21 birds test data provided as part of the SSW60 dataset.

Towards Vision Mixture of Experts for Wildlife Monitoring on the Edge

TL;DR

Inspired by recent work on data driven conditional computation of subnetworks, a similar per patch conditional computation for the first time for mobile vision transformers (vision only case), that will eventually be used for single-tower multimodal edge models is explored.

Abstract

The explosion of IoT sensors in industrial, consumer and remote sensing use cases has come with unprecedented demand for computing infrastructure to transmit and to analyze petabytes of data. Concurrently, the world is slowly shifting its focus towards more sustainable computing. For these reasons, there has been a recent effort to reduce the footprint of related computing infrastructure, especially by deep learning algorithms, for advanced insight generation. The `TinyML' community is actively proposing methods to save communication bandwidth and excessive cloud storage costs while reducing algorithm inference latency and promoting data privacy. Such proposed approaches should ideally process multiple types of data, including time series, audio, satellite images, and video, near the network edge as multiple data streams has been shown to improve the discriminative ability of learning algorithms, especially for generating fine grained results. Incidentally, there has been recent work on data driven conditional computation of subnetworks that has shown real progress in using a single model to share parameters among very different types of inputs such as images and text, reducing the computation requirement of multi-tower multimodal networks. Inspired by such line of work, we explore similar per patch conditional computation for the first time for mobile vision transformers (vision only case), that will eventually be used for single-tower multimodal edge models. We evaluate the model on Cornell Sap Sucker Woods 60, a fine grained bird species discrimination dataset. Our initial experiments uses fewer parameters compared to MobileViTV2-1.0 with a % accuracy drop on the iNaturalist '21 birds test data provided as part of the SSW60 dataset.

Paper Structure

This paper contains 20 sections, 2 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Our proposed system based on the MobileViT mehta2021mobilevit model. The expert assignment router for any transformer mixture of expert layer is initialized with hierarchical clustering of sample training data embeddings collected after the attention computation of a pretrained network.
  • Figure 2: Sample expert groupings in the last MobileVitV2 transformer layer for iNaturalist 2021 birds species dataset included in the SSW60. It is worth noting that these are reports of relatively well behaved groupings and section \ref{['sec:expert_class_affinity']} discusses overall expert assignment behaviour across classes.
  • Figure 3: We show the expert/agglomerate cluster patch assignment probability distribution as discussed in \ref{['sec:expert_class_affinity']}. $a$ is the class-expert affinity after finetuning the mixture of expert model with a randomly initialized router for $80$ epochs. $b$ shows the post fine tuning class-expert affinity when the router is initialized with the cluster experts which at initialization have the affinity shown in $c$. Finally, $d$ uses a softmax temperature of $0.001$ to compute affinities and sets scores lower than $0.05$ to zero.