StreamTinyNet: video streaming analysis with spatial-temporal TinyML

Hazem Hesham Yousef Shalby; Massimo Pavan; Manuel Roveri

StreamTinyNet: video streaming analysis with spatial-temporal TinyML

Hazem Hesham Yousef Shalby, Massimo Pavan, Manuel Roveri

TL;DR

StreamTinyNet is the first TinyML architecture to enable multi-frame video streaming analysis on resource-constrained devices by decoupling spatial feature extraction from temporal fusion. The approach uses a CNN-based spatial extractor $g(ullet)$ to produce compact frame features, followed by a temporal fusion stage $h(ullet)$ that aggregates across a sliding window of $T$ frames, facilitated by $1\times1$ temporal processing and a dense head. Experimental results on gesture recognition and GolfDB demonstrate that multi-frame analysis yields notable accuracy gains with modest increases in memory and compute, outperforming frame-by-frame TinyML baselines. The work further validates practicality by porting to the Arduino Nicla Vision with quantization, achieving real-time performance and low energy per inference, highlighting significant potential for on-device temporal video understanding in low-power environments.

Abstract

Tiny Machine Learning (TinyML) is a branch of Machine Learning (ML) that constitutes a bridge between the ML world and the embedded system ecosystem (i.e., Internet of Things devices, embedded devices, and edge computing units), enabling the execution of ML algorithms on devices constrained in terms of memory, computational capabilities, and power consumption. Video Streaming Analysis (VSA), one of the most interesting tasks of TinyML, consists in scanning a sequence of frames in a streaming manner, with the goal of identifying interesting patterns. Given the strict constraints of these tiny devices, all the current solutions rely on performing a frame-by-frame analysis, hence not exploiting the temporal component in the stream of data. In this paper, we present StreamTinyNet, the first TinyML architecture to perform multiple-frame VSA, enabling a variety of use cases that requires spatial-temporal analysis that were previously impossible to be carried out at a TinyML level. Experimental results on public-available datasets show the effectiveness and efficiency of the proposed solution. Finally, StreamTinyNet has been ported and tested on the Arduino Nicla Vision, showing the feasibility of what proposed.

StreamTinyNet: video streaming analysis with spatial-temporal TinyML

TL;DR

to produce compact frame features, followed by a temporal fusion stage

that aggregates across a sliding window of

frames, facilitated by

temporal processing and a dense head. Experimental results on gesture recognition and GolfDB demonstrate that multi-frame analysis yields notable accuracy gains with modest increases in memory and compute, outperforming frame-by-frame TinyML baselines. The work further validates practicality by porting to the Arduino Nicla Vision with quantization, achieving real-time performance and low energy per inference, highlighting significant potential for on-device temporal video understanding in low-power environments.

Abstract

Paper Structure (21 sections, 7 equations, 3 figures, 6 tables)

This paper contains 21 sections, 7 equations, 3 figures, 6 tables.

Introduction
Related works
TinyML
Approximate computing mechanisms
Network architecture redesigning
Video stream analysis in TinyML
Video classification
The proposed solution
Problem formulation
Architecture overview
StreamTinyNet description
Network complexity
Experimental results
Gesture recognition
Dataset generation
...and 6 more sections

Figures (3)

Figure 1: An overview of the proposed architecture
Figure 2: Representation of the feature extractor $g(\cdot)$.
Figure 3: Representation of the proposed pipeline of $h(\cdot)$.

StreamTinyNet: video streaming analysis with spatial-temporal TinyML

TL;DR

Abstract

StreamTinyNet: video streaming analysis with spatial-temporal TinyML

Authors

TL;DR

Abstract

Table of Contents

Figures (3)