Language-Assisted Deep Learning for Autistic Behaviors Recognition
Andong Deng, Taojiannan Yang, Chen Chen, Qian Chen, Leslie Neely, Sakiko Oyama
TL;DR
The paper tackles ASD-related problem-behavior recognition from video, a task so far limited by reliance on vision-only models. It first establishes Video Swin Transformer as a strong baseline on the ESBD and SSBD ASD-behavior datasets and then introduces language-assisted training by incorporating detailed textual descriptions of each behavior via a CLIP text encoder to provide cross-modal supervision. The approach uses a joint objective L = L_{CE} + λ L_{contrastive} with $L_{contrastive} = - \frac{v \cdot l}{|v|\cdot|l|}$ to align visual and language representations, while inference uses only the visual branch, ensuring no extra cost at test time. Empirical results show that language supervision yields meaningful improvements over the video-only baseline, with detailed descriptions outperforming simple class names and pretrained backbones providing additional gains, highlighting a practical path toward more objective ASD behavior analysis and early intervention support.
Abstract
Correctly recognizing the behaviors of children with Autism Spectrum Disorder (ASD) is of vital importance for the diagnosis of Autism and timely early intervention. However, the observation and recording during the treatment from the parents of autistic children may not be accurate and objective. In such cases, automatic recognition systems based on computer vision and machine learning (in particular deep learning) technology can alleviate this issue to a large extent. Existing human action recognition models can now achieve persuasive performance on challenging activity datasets, e.g. daily activity, and sports activity. However, problem behaviors in children with ASD are very different from these general activities, and recognizing these problem behaviors via computer vision is less studied. In this paper, we first evaluate a strong baseline for action recognition, i.e. Video Swin Transformer, on two autism behaviors datasets (SSBD and ESBD) and show that it can achieve high accuracy and outperform the previous methods by a large margin, demonstrating the feasibility of vision-based problem behaviors recognition. Moreover, we propose language-assisted training to further enhance the action recognition performance. Specifically, we develop a two-branch multimodal deep learning framework by incorporating the "freely available" language description for each type of problem behavior. Experimental results demonstrate that incorporating additional language supervision can bring an obvious performance boost for the autism problem behaviors recognition task as compared to using the video information only (i.e. 3.49% improvement on ESBD and 1.46% on SSBD).
