StreamTinyNet: video streaming analysis with spatial-temporal TinyML
Hazem Hesham Yousef Shalby, Massimo Pavan, Manuel Roveri
TL;DR
StreamTinyNet is the first TinyML architecture to enable multi-frame video streaming analysis on resource-constrained devices by decoupling spatial feature extraction from temporal fusion. The approach uses a CNN-based spatial extractor $g(ullet)$ to produce compact frame features, followed by a temporal fusion stage $h(ullet)$ that aggregates across a sliding window of $T$ frames, facilitated by $1\times1$ temporal processing and a dense head. Experimental results on gesture recognition and GolfDB demonstrate that multi-frame analysis yields notable accuracy gains with modest increases in memory and compute, outperforming frame-by-frame TinyML baselines. The work further validates practicality by porting to the Arduino Nicla Vision with quantization, achieving real-time performance and low energy per inference, highlighting significant potential for on-device temporal video understanding in low-power environments.
Abstract
Tiny Machine Learning (TinyML) is a branch of Machine Learning (ML) that constitutes a bridge between the ML world and the embedded system ecosystem (i.e., Internet of Things devices, embedded devices, and edge computing units), enabling the execution of ML algorithms on devices constrained in terms of memory, computational capabilities, and power consumption. Video Streaming Analysis (VSA), one of the most interesting tasks of TinyML, consists in scanning a sequence of frames in a streaming manner, with the goal of identifying interesting patterns. Given the strict constraints of these tiny devices, all the current solutions rely on performing a frame-by-frame analysis, hence not exploiting the temporal component in the stream of data. In this paper, we present StreamTinyNet, the first TinyML architecture to perform multiple-frame VSA, enabling a variety of use cases that requires spatial-temporal analysis that were previously impossible to be carried out at a TinyML level. Experimental results on public-available datasets show the effectiveness and efficiency of the proposed solution. Finally, StreamTinyNet has been ported and tested on the Arduino Nicla Vision, showing the feasibility of what proposed.
