Table of Contents
Fetching ...

Language-Assisted Deep Learning for Autistic Behaviors Recognition

Andong Deng, Taojiannan Yang, Chen Chen, Qian Chen, Leslie Neely, Sakiko Oyama

TL;DR

The paper tackles ASD-related problem-behavior recognition from video, a task so far limited by reliance on vision-only models. It first establishes Video Swin Transformer as a strong baseline on the ESBD and SSBD ASD-behavior datasets and then introduces language-assisted training by incorporating detailed textual descriptions of each behavior via a CLIP text encoder to provide cross-modal supervision. The approach uses a joint objective L = L_{CE} + λ L_{contrastive} with $L_{contrastive} = - \frac{v \cdot l}{|v|\cdot|l|}$ to align visual and language representations, while inference uses only the visual branch, ensuring no extra cost at test time. Empirical results show that language supervision yields meaningful improvements over the video-only baseline, with detailed descriptions outperforming simple class names and pretrained backbones providing additional gains, highlighting a practical path toward more objective ASD behavior analysis and early intervention support.

Abstract

Correctly recognizing the behaviors of children with Autism Spectrum Disorder (ASD) is of vital importance for the diagnosis of Autism and timely early intervention. However, the observation and recording during the treatment from the parents of autistic children may not be accurate and objective. In such cases, automatic recognition systems based on computer vision and machine learning (in particular deep learning) technology can alleviate this issue to a large extent. Existing human action recognition models can now achieve persuasive performance on challenging activity datasets, e.g. daily activity, and sports activity. However, problem behaviors in children with ASD are very different from these general activities, and recognizing these problem behaviors via computer vision is less studied. In this paper, we first evaluate a strong baseline for action recognition, i.e. Video Swin Transformer, on two autism behaviors datasets (SSBD and ESBD) and show that it can achieve high accuracy and outperform the previous methods by a large margin, demonstrating the feasibility of vision-based problem behaviors recognition. Moreover, we propose language-assisted training to further enhance the action recognition performance. Specifically, we develop a two-branch multimodal deep learning framework by incorporating the "freely available" language description for each type of problem behavior. Experimental results demonstrate that incorporating additional language supervision can bring an obvious performance boost for the autism problem behaviors recognition task as compared to using the video information only (i.e. 3.49% improvement on ESBD and 1.46% on SSBD).

Language-Assisted Deep Learning for Autistic Behaviors Recognition

TL;DR

The paper tackles ASD-related problem-behavior recognition from video, a task so far limited by reliance on vision-only models. It first establishes Video Swin Transformer as a strong baseline on the ESBD and SSBD ASD-behavior datasets and then introduces language-assisted training by incorporating detailed textual descriptions of each behavior via a CLIP text encoder to provide cross-modal supervision. The approach uses a joint objective L = L_{CE} + λ L_{contrastive} with to align visual and language representations, while inference uses only the visual branch, ensuring no extra cost at test time. Empirical results show that language supervision yields meaningful improvements over the video-only baseline, with detailed descriptions outperforming simple class names and pretrained backbones providing additional gains, highlighting a practical path toward more objective ASD behavior analysis and early intervention support.

Abstract

Correctly recognizing the behaviors of children with Autism Spectrum Disorder (ASD) is of vital importance for the diagnosis of Autism and timely early intervention. However, the observation and recording during the treatment from the parents of autistic children may not be accurate and objective. In such cases, automatic recognition systems based on computer vision and machine learning (in particular deep learning) technology can alleviate this issue to a large extent. Existing human action recognition models can now achieve persuasive performance on challenging activity datasets, e.g. daily activity, and sports activity. However, problem behaviors in children with ASD are very different from these general activities, and recognizing these problem behaviors via computer vision is less studied. In this paper, we first evaluate a strong baseline for action recognition, i.e. Video Swin Transformer, on two autism behaviors datasets (SSBD and ESBD) and show that it can achieve high accuracy and outperform the previous methods by a large margin, demonstrating the feasibility of vision-based problem behaviors recognition. Moreover, we propose language-assisted training to further enhance the action recognition performance. Specifically, we develop a two-branch multimodal deep learning framework by incorporating the "freely available" language description for each type of problem behavior. Experimental results demonstrate that incorporating additional language supervision can bring an obvious performance boost for the autism problem behaviors recognition task as compared to using the video information only (i.e. 3.49% improvement on ESBD and 1.46% on SSBD).
Paper Structure (13 sections, 6 equations, 7 figures, 5 tables)

This paper contains 13 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Example video frames from the ESBD and SSBD datasets for ASD behavior recognition in children.
  • Figure 2: Illustration of the data preprocessing steps. We first detect the target child in video frames via the YOLO v5 object detector and then crop the child region with the detected bounding boxes. These cropped frames with arbitrary sizes will be resized to the same size, i.e., $224\times 224$, before being fed into the action recognition network.
  • Figure 3: Illustration of Video Swin Transformer with basic configuration.
  • Figure 4: Comparison of the visual only Video Swin Transformer (VST) and the proposed language-assisted framework (VST+L). For VST+L, we also propose two alternatives, one of which utilizes the action class name as the language input, which is noted as VST+L(w), and another one, VST+L(d), leverages more detailed descriptions for each action class, leading to more informative language features. The comparison of the two alternatives can be found in Table \ref{['text input']}. Generally, the input video is processed by the Video Swin Transformer network for visual feature extraction and the text description of the action in this video is processed by the CLIP text encoder for textural feature extraction. We use cross-entropy $L_{CE}$ as the classification loss and add the contrastive loss ($L_{constrastive}$) to enforce the paired visual and textural features are close to each other. It should be noted that language supervision is only used in the model training stage to enhance the visual feature representation. In the test/inference stage, only the visual branch is used to extract the feature from a test video for action recognition.
  • Figure 5: Illustration of the confusion matrix on (a)ESBD and (b)SSBD. "+L" means language supervision and "pretrained" means the visual branch uses the Kinetics400 pre-trained model and then is fine-tuned on ESBD or SSBD. It is obvious that both language supervision and pretraining could generally result in performance improvement.
  • ...and 2 more figures