Video-Based Autism Detection with Deep Learning
M. Serna-Aguilera, X. B. Nguyen, A. Singh, L. Rockers, S. Park, L. Neely, H. Seo, K. Luu
TL;DR
This study demonstrates a video-based autism detection approach that eliminates the need for MRI by employing dual CNN backbones to capture movement and facial expressions, connected through a temporal transformer to leverage temporal context. Trained and evaluated on cross-institution video data with controlled stimuli, the method achieves ~81% accuracy and robust F1 scores despite limited data and frame information. The work highlights practical implications for accessible, bedside screening and outlines directions to handle broader head poses and occlusions in future data collection and modeling.
Abstract
Individuals with Autism Spectrum Disorder (ASD) often experience challenges in health, communication, and sensory processing; therefore, early diagnosis is necessary for proper treatment and care. In this work, we consider the problem of detecting or classifying ASD children to aid medical professionals in early diagnosis. We develop a deep learning model that analyzes video clips of children reacting to sensory stimuli, with the intent of capturing key differences in reactions and behavior between ASD and non-ASD participants. Unlike many recent studies in ASD classification with MRI data, which require expensive specialized equipment, our method utilizes a powerful but relatively affordable GPU, a standard computer setup, and a video camera for inference. Results show that our model effectively generalizes and understands key differences in the distinct movements of the children. It is noteworthy that our model exhibits successful classification performance despite the limited amount of data for a deep learning problem and limited temporal information available for learning, even with the motion artifacts.
