YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition

Duc Manh Nguyen Dang; Viet Hang Duong; Jia Ching Wang; Nhan Bui Duc

YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition

Duc Manh Nguyen Dang, Viet Hang Duong, Jia Ching Wang, Nhan Bui Duc

TL;DR

YOWOv3 tackles the computational burden of spatio-temporal action detection by presenting a lightweight, generalized two-stream framework that fuses spatial and temporal cues via a Fusion Head with CFAM attention and uses TAL or SimOTA for label assignment with tailored loss functions. It replaces heavy backbones with a YOLOv8-based spatial extractor and a 3D temporal backbone, achieving competitive mAP on UCF101-24 and AVAv2.2 while substantially reducing parameters and GFLOPs relative to YOWOv2. The work demonstrates strong empirical efficiency, provides multiple pretrained configurations for quick fine-tuning, and offers concrete ablations to justify design choices like class balancing and fixed top-k, making it practical for real-time STAD deployment.

Abstract

In this paper, we propose a new framework called YOWOv3, which is an improved version of YOWOv2, designed specifically for the task of Human Action Detection and Recognition. This framework is designed to facilitate extensive experimentation with different configurations and supports easy customization of various components within the model, reducing efforts required for understanding and modifying the code. YOWOv3 demonstrates its superior performance compared to YOWOv2 on two widely used datasets for Human Action Detection and Recognition: UCF101-24 and AVAv2.2. Specifically, the predecessor model YOWOv2 achieves an mAP of 85.2% and 20.3% on UCF101-24 and AVAv2.2, respectively, with 109.7M parameters and 53.6 GFLOPs. In contrast, our model - YOWOv3, with only 59.8M parameters and 39.8 GFLOPs, achieves an mAP of 88.33% and 20.31% on UCF101-24 and AVAv2.2, respectively. The results demonstrate that YOWOv3 significantly reduces the number of parameters and GFLOPs while still achieving comparable performance.

YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition

TL;DR

Abstract

Paper Structure (32 sections, 13 equations, 4 figures, 5 tables)

This paper contains 32 sections, 13 equations, 4 figures, 5 tables.

Introduction
YOWO AND YOWOV2
PROPOSED FRAMEWORK
OVERVIEW
Introduction
Spatial Feature Extractor
Decoupled Head
Temporal Motion Feature Extractor
Fusion Head
Detection Head
LABEL ASSIGNMENT
Introduction
TAL
SimOTA
LOSS FUNCTION
...and 17 more sections

Figures (4)

Figure 1: Trade-off between parameters and mAP on UCF101-24. YOWOv3 proves to be an efficient model by enhancing performance while still utilizing computational resources better than previous models.
Figure 2: An overview architecture of YOWOv3
Figure 3: Overview of Channel Fusion and Attention Mechanism (CFAM) - an attention mechanism in YOWO
Figure 4: Visualization of YOWOv3 on UCF101-24 and AVAv2.2

YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition

TL;DR

Abstract

YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (4)