Vision-Based Activity Recognition in Children with Autism-Related Behaviors

Pengbo Wei; David Ahmedt-Aristizabal; Harshala Gammulle; Simon Denman; Mohammad Ali Armin

Vision-Based Activity Recognition in Children with Autism-Related Behaviors

Pengbo Wei, David Ahmedt-Aristizabal, Harshala Gammulle, Simon Denman, Mohammad Ali Armin

TL;DR

The paper tackles autism-related behavior recognition from videos captured in uncontrolled environments by proposing a region-based vision framework that crops the target child and applies temporal models to classify arm flapping, headbanging, and spinning. It systematically evaluates a range of feature extractors (both lightweight and conventional) and temporal models (LSTM, TCN, MS-TCN, MS-TCN++) using an extended SSBD-derived dataset with 61 videos and 168 clips, augmented with a customized Kinetics subset. The best overall performance is achieved with RGB I3D plus MS-TCN++ at a weighted F1-score of $0.83$, while a lightweight ESNet plus MS-TCN++ approach achieves $0.71$, suggesting feasible deployment on embedded systems. The work demonstrates the effectiveness of temporal convolutional architectures for ASD-behavior analysis in real-world videos and points toward real-time, region-focused diagnosis support for clinicians and caregivers. It also opens avenues for extending the approach to additional motor and mental health conditions through dataset expansion and hardware-efficient optimizations.

Abstract

Advances in machine learning and contactless sensors have enabled the understanding complex human behaviors in a healthcare setting. In particular, several deep learning systems have been introduced to enable comprehensive analysis of neuro-developmental conditions such as Autism Spectrum Disorder (ASD). This condition affects children from their early developmental stages onwards, and diagnosis relies entirely on observing the child's behavior and detecting behavioral cues. However, the diagnosis process is time-consuming as it requires long-term behavior observation, and the scarce availability of specialists. We demonstrate the effect of a region-based computer vision system to help clinicians and parents analyze a child's behavior. For this purpose, we adopt and enhance a dataset for analyzing autism-related actions using videos of children captured in uncontrolled environments (e.g. videos collected with consumer-grade cameras, in varied environments). The data is pre-processed by detecting the target child in the video to reduce the impact of background noise. Motivated by the effectiveness of temporal convolutional models, we propose both light-weight and conventional models capable of extracting action features from video frames and classifying autism-related behaviors by analyzing the relationships between frames in a video. Through extensive evaluations on the feature extraction and learning strategies, we demonstrate that the best performance is achieved with an Inflated 3D Convnet and Multi-Stage Temporal Convolutional Networks, achieving a 0.83 Weighted F1-score for classification of the three autism-related actions, outperforming existing methods. We also propose a light-weight solution by employing the ESNet backbone within the same system, achieving competitive results of 0.71 Weighted F1-score, and enabling potential deployment on embedded systems.

Vision-Based Activity Recognition in Children with Autism-Related Behaviors

TL;DR

, while a lightweight ESNet plus MS-TCN++ approach achieves

, suggesting feasible deployment on embedded systems. The work demonstrates the effectiveness of temporal convolutional architectures for ASD-behavior analysis in real-world videos and points toward real-time, region-focused diagnosis support for clinicians and caregivers. It also opens avenues for extending the approach to additional motor and mental health conditions through dataset expansion and hardware-efficient optimizations.

Abstract

Paper Structure (22 sections, 1 equation, 5 figures, 8 tables)

This paper contains 22 sections, 1 equation, 5 figures, 8 tables.

Introduction
Methods
Feature extraction backbones
EfficientNets
MobileNets
ShuffleNet
ESNet
ResNet
Inflated 3D Convnet (I3D)
Action recognition models
Long short-term memory (LSTM)
Temporal Convolutional Networks (TCN)
Multi-Stage Temporal Convolutional Network (MS-TCN)
Extended Multi-Stage Temporal Convolutional Network (MS-TCN++)
Experimental setup
...and 7 more sections

Figures (5)

Figure 1: Overview of the activity recognition pipeline. Given a sequence of RGB video frames, the system generates a sequence of feature vectors via the feature extractor. Next, the action recognition model recognizes the activity by using the generated sequence of visual features.
Figure 2: Multi-stage temporal convolutional network. Each stage produces an initial prediction, which is refined by the subsequent stage. Recreated from farha2019ms.
Figure 3: A comparison between MS-TCN and MS-TCN++. A presentation of the dilated residual layer in MS-TCN (Left) and a dual dilation residual layer in MS-TCN++ (Right). Adapted from gammulle2022continuous.
Figure 4: Selected frames from the videos in the dataset used. From top to bottom, the autism-related behaviors 'arm flapping', 'headbanging', and 'Spinning' are shown. These videos were recorded in an uncontrolled environment, and some videos contain activities that involve human interactions (e.g. the sample in the first row).
Figure 5: The pre-processing procedure: The child of interest is detected using the Detectron2 wu2019detectron2 human detector, after which the target child is cropped from the video.

Vision-Based Activity Recognition in Children with Autism-Related Behaviors

TL;DR

Abstract

Vision-Based Activity Recognition in Children with Autism-Related Behaviors

Authors

TL;DR

Abstract

Table of Contents

Figures (5)