Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition

Shijian Deng; Erin E. Kosloski; Siddhi Patel; Zeke A. Barnett; Yiyang Nan; Alexander Kaplan; Sisira Aarukapalli; William T. Doan; Matthew Wang; Harsh Singh; Pamela R. Rollins; Yapeng Tian

Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition

Shijian Deng, Erin E. Kosloski, Siddhi Patel, Zeke A. Barnett, Yiyang Nan, Alexander Kaplan, Sisira Aarukapalli, William T. Doan, Matthew Wang, Harsh Singh, Pamela R. Rollins, Yapeng Tian

TL;DR

This work defines audio-visual autism behavior recognition and introduces AV-ASD, a large, multi-label dataset spanning social-communication and repetitive behaviors. It demonstrates that integrating audio, visual, and speech cues via foundation models and multimodal LLMs improves recognition over vision-only approaches, and introduces LLaVA-ASD, an instruction-tuned model that leverages audio captions and speech transcriptions for superior performance. A novel post-hoc to ad-hoc explainability framework is proposed to maintain predictive accuracy while preserving model explanations, addressing a key challenge in using MLLMs for clinical tasks. The AV-ASD dataset, baselines, and the explainability framework collectively advance objective, scalable autism screening and offer a rich resource for future multimodal ASD research.

Abstract

In this article, we introduce a novel problem of audio-visual autism behavior recognition, which includes social behavior recognition, an essential aspect previously omitted in AI-assisted autism screening research. We define the task at hand as one that is audio-visual autism behavior recognition, which uses audio and visual cues, including any speech present in the audio, to recognize autism-related behaviors. To facilitate this new research direction, we collected an audio-visual autism spectrum dataset (AV-ASD), currently the largest video dataset for autism screening using a behavioral approach. It covers an extensive range of autism-associated behaviors, including those related to social communication and interaction. To pave the way for further research on this new problem, we intensively explored leveraging foundation models and multimodal large language models across different modalities. Our experiments on the AV-ASD dataset demonstrate that integrating audio, visual, and speech modalities significantly enhances the performance in autism behavior recognition. Additionally, we explored the use of a post-hoc to ad-hoc pipeline in a multimodal large language model to investigate its potential to augment the model's explanatory capability during autism behavior recognition. We will release our dataset, code, and pre-trained models.

Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition

TL;DR

Abstract

Paper Structure (17 sections, 3 equations, 8 figures, 4 tables)

This paper contains 17 sections, 3 equations, 8 figures, 4 tables.

Introduction
Related Work
AI and Datasets in ASD Research
Audio-Visual Learning
The AV-ASD Dataset
Multimodal Nature of Social Behaviors
Data Collection, Annotation, and Statistics
Annotators and Instructions for Annotation
Autism Behavior Recognition
Zero-shot Baselines with MLLMs
MLLMs with Instruction Tuning
Experiments
Experimental Setup
Results and Analysis
Beyond Recognition: Explanibility
...and 2 more sections

Figures (8)

Figure 1: The vision-only model incorrectly identified two behaviors that were not present, whereas the audio-visual model correctly identified the three behaviors present in the clip. This illustrates how multimodal integration enables more accurate behavior identification.
Figure 2: A depiction of the AV-ASD dataset, illustrating five sample instances from each category.
Figure 3: An ASD child responds "I have hands." to the person who asks "How old are you?"
Figure 4: Statistical illustrations of the AV-ASD dataset.
Figure 5: LLaVA-ASD: Instruction Tuning for LLaVA. Given a video preview $I_V$ and an enhanced text prompt $P'$, which is a text prompt $P$ augmented with an audio caption $AC$ and speech transcription $ST$. These elements are combined to form the model's instruction input $Inst$. The output consists of multiple autism behavior labels presented in text format as $y$. We employed LoRA for efficient fine-tuning.
...and 3 more figures

Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition

TL;DR

Abstract

Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (8)