YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition
Duc Manh Nguyen Dang, Viet Hang Duong, Jia Ching Wang, Nhan Bui Duc
TL;DR
YOWOv3 tackles the computational burden of spatio-temporal action detection by presenting a lightweight, generalized two-stream framework that fuses spatial and temporal cues via a Fusion Head with CFAM attention and uses TAL or SimOTA for label assignment with tailored loss functions. It replaces heavy backbones with a YOLOv8-based spatial extractor and a 3D temporal backbone, achieving competitive mAP on UCF101-24 and AVAv2.2 while substantially reducing parameters and GFLOPs relative to YOWOv2. The work demonstrates strong empirical efficiency, provides multiple pretrained configurations for quick fine-tuning, and offers concrete ablations to justify design choices like class balancing and fixed top-k, making it practical for real-time STAD deployment.
Abstract
In this paper, we propose a new framework called YOWOv3, which is an improved version of YOWOv2, designed specifically for the task of Human Action Detection and Recognition. This framework is designed to facilitate extensive experimentation with different configurations and supports easy customization of various components within the model, reducing efforts required for understanding and modifying the code. YOWOv3 demonstrates its superior performance compared to YOWOv2 on two widely used datasets for Human Action Detection and Recognition: UCF101-24 and AVAv2.2. Specifically, the predecessor model YOWOv2 achieves an mAP of 85.2% and 20.3% on UCF101-24 and AVAv2.2, respectively, with 109.7M parameters and 53.6 GFLOPs. In contrast, our model - YOWOv3, with only 59.8M parameters and 39.8 GFLOPs, achieves an mAP of 88.33% and 20.31% on UCF101-24 and AVAv2.2, respectively. The results demonstrate that YOWOv3 significantly reduces the number of parameters and GFLOPs while still achieving comparable performance.
