Shifting Capsule Networks from the Cloud to the Deep Edge

Miguel Costa; Diogo Costa; Tiago Gomes; Sandro Pinto

Shifting Capsule Networks from the Cloud to the Deep Edge

Miguel Costa, Diogo Costa, Tiago Gomes, Sandro Pinto

TL;DR

This work tackles the challenge of deploying Capsule Networks (CapsNets) on edge devices by extending open-source kernels (CMSIS-NN and PULP-NN) to support 8-bit quantized CapsNets on Arm Cortex-M and RISC-V MCUs. It introduces a post-training quantization framework compatible with the $Q_m.n$ format and evaluates latency, accuracy loss, and memory footprint on MNIST, smallNORB, and CIFAR-10, reporting ~75% memory reduction with up to 0.18% accuracy loss. The authors implement and optimize matrix multiplication, squash, and capsule-layer kernels, achieving sub-100 ms latency on Cortex-M and sub-50 ms scales on GAP-8 across varying kernel sizes, with RISC-V benefiting particularly from SIMD and multi-core acceleration. This open-source, edge-focused pipeline demonstrates the practical viability of CapsNets at the deep edge, paving the way for lighter CapsNet architectures and future enhancements like mixed-bit quantization and pruning.

Abstract

Capsule networks (CapsNets) are an emerging trend in image processing. In contrast to a convolutional neural network, CapsNets are not vulnerable to object deformation, as the relative spatial information of the objects is preserved across the network. However, their complexity is mainly related to the capsule structure and the dynamic routing mechanism, which makes it almost unreasonable to deploy a CapsNet, in its original form, in a resource-constrained device powered by a small microcontroller (MCU). In an era where intelligence is rapidly shifting from the cloud to the edge, this high complexity imposes serious challenges to the adoption of CapsNets at the very edge. To tackle this issue, we present an API for the execution of quantized CapsNets in Arm Cortex-M and RISC-V MCUs. Our software kernels extend the Arm CMSIS-NN and RISC-V PULP-NN to support capsule operations with 8-bit integers as operands. Along with it, we propose a framework to perform post-training quantization of a CapsNet. Results show a reduction in memory footprint of almost 75%, with accuracy loss ranging from 0.07% to 0.18%. In terms of throughput, our Arm Cortex-M API enables the execution of primary capsule and capsule layers with medium-sized kernels in just 119.94 and 90.60 milliseconds (ms), respectively (STM32H755ZIT6U, Cortex-M7 @ 480 MHz). For the GAP-8 SoC (RISC-V RV32IMCXpulp @ 170 MHz), the latency drops to 7.02 and 38.03 ms, respectively.

Shifting Capsule Networks from the Cloud to the Deep Edge

TL;DR

Abstract

Shifting Capsule Networks from the Cloud to the Deep Edge

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)