Table of Contents
Fetching ...

CA3D: Convolutional-Attentional 3D Nets for Efficient Video Activity Recognition on the Edge

Gabriele Lagani, Fabrizio Falchi, Claudio Gennaro, Giuseppe Amato

TL;DR

The paper addresses the challenge of accurate video activity recognition on edge devices with privacy constraints by proposing CA3D, a hybrid Convolutional-Attentional 3D network built from CAST blocks that blend spatial convolutions with a linear-complexity temporal attention. A novel quantization mechanism maps pre-parameters to weights to enable training and inference entirely in 16-bit precision, reducing memory and compute without relying on extensive pretraining. Experimental results on UCF101, HMDB51, and Kinetics400 show competitive accuracy under no external pretraining and with favorable compute footprints, including edge-friendly performance. This work advances practical, privacy-preserving video understanding for real-time edge applications by combining inductive biases of CNNs with efficient attention and a memory-efficient quantization strategy.

Abstract

In this paper, we introduce a deep learning solution for video activity recognition that leverages an innovative combination of convolutional layers with a linear-complexity attention mechanism. Moreover, we introduce a novel quantization mechanism to further improve the efficiency of our model during both training and inference. Our model maintains a reduced computational cost, while preserving robust learning and generalization capabilities. Our approach addresses the issues related to the high computing requirements of current models, with the goal of achieving competitive accuracy on consumer and edge devices, enabling smart home and smart healthcare applications where efficiency and privacy issues are of concern. We experimentally validate our model on different established and publicly available video activity recognition benchmarks, improving accuracy over alternative models at a competitive computing cost.

CA3D: Convolutional-Attentional 3D Nets for Efficient Video Activity Recognition on the Edge

TL;DR

The paper addresses the challenge of accurate video activity recognition on edge devices with privacy constraints by proposing CA3D, a hybrid Convolutional-Attentional 3D network built from CAST blocks that blend spatial convolutions with a linear-complexity temporal attention. A novel quantization mechanism maps pre-parameters to weights to enable training and inference entirely in 16-bit precision, reducing memory and compute without relying on extensive pretraining. Experimental results on UCF101, HMDB51, and Kinetics400 show competitive accuracy under no external pretraining and with favorable compute footprints, including edge-friendly performance. This work advances practical, privacy-preserving video understanding for real-time edge applications by combining inductive biases of CNNs with efficient attention and a memory-efficient quantization strategy.

Abstract

In this paper, we introduce a deep learning solution for video activity recognition that leverages an innovative combination of convolutional layers with a linear-complexity attention mechanism. Moreover, we introduce a novel quantization mechanism to further improve the efficiency of our model during both training and inference. Our model maintains a reduced computational cost, while preserving robust learning and generalization capabilities. Our approach addresses the issues related to the high computing requirements of current models, with the goal of achieving competitive accuracy on consumer and edge devices, enabling smart home and smart healthcare applications where efficiency and privacy issues are of concern. We experimentally validate our model on different established and publicly available video activity recognition benchmarks, improving accuracy over alternative models at a competitive computing cost.

Paper Structure

This paper contains 11 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Design of the Convolutional-Attentional 3D (CA3D) neural network for video processing, as a series of Convolutional-Attentional Spatio-Temporal (CAST) blocks. In CAST blocks, convolutional layers are alternated with attention layers, to take advantage of both types of processing. Convolutions are applied along the spatial dimensions, while attention aggregates global information from different frames along the temporal dimension. Processing of each layer is further enhanced with a deep column of residual stages.
  • Figure 2: Structure of the CA3D architecture in terms of CAST blocks. Blocks are formed by spatial and temporal processing parts. The spatial part is implemented in terms of convolutional blocks, followed by a column of residually connected layers. The temporal part is mediated by an attention operator, which is followed again by a column of residual layers. Pooling layers shrink the size of the tensor along the temporal dimension.