Table of Contents
Fetching ...

Detecting Children with Autism Spectrum Disorder based on Script-Centric Behavior Understanding with Emotional Enhancement

Wenxing Liu, Yueran Pan, Dong Zhang, Hongzhu Deng, Xiaobing Zou, Ming Li

TL;DR

The paper tackles early ASD detection from limited audio-visual data by converting videos into textual behavior scripts and leveraging large language models in zero-shot and few-shot settings. It introduces a three-module SCBU pipeline—Behavior Transcription, Script Transcription, and emotion-aware domain prompting—with multi-LLM collaboration to achieve high diagnostic performance and interpretable rationales. Key contributions include the script transcription framework, emotion textualization of emotional dynamics, and a domain-prompting strategy that injects clinical ASD knowledge. Experimental results on a toddler-age ASD dataset show state-of-the-art zero-shot and few-shot performance (F1 up to 95.24%) and provide explainable detection rationales, suggesting practical potential for clinical screening and decision support.

Abstract

The early diagnosis of autism spectrum disorder (ASD) is critically dependent on systematic observation and analysis of children's social behaviors. While current methodologies predominantly utilize supervised learning approaches, their clinical adoption faces two principal limitations: insufficient ASD diagnostic samples and inadequate interpretability of the detection outcomes. This paper presents a novel zero-shot ASD detection framework based on script-centric behavioral understanding with emotional enhancement, which is designed to overcome the aforementioned clinical constraints. The proposed pipeline automatically converts audio-visual data into structured behavioral text scripts through computer vision techniques, subsequently capitalizing on the generalization capabilities of large language models (LLMs) for zero-shot/few-shot ASD detection. Three core technical contributions are introduced: (1) A multimodal script transcription module transforming behavioral cues into structured textual representations. (2) An emotion textualization module encoding emotional dynamics as the contextual features to augment behavioral understanding. (3) A domain-specific prompt engineering strategy enables the injection of clinical knowledge into LLMs. Our method achieves an F1-score of 95.24\% in diagnosing ASD in children with an average age of two years while generating interpretable detection rationales. This work opens up new avenues for leveraging the power of LLMs in analyzing and understanding ASD-related human behavior, thereby enhancing the accuracy of assisted autism diagnosis.

Detecting Children with Autism Spectrum Disorder based on Script-Centric Behavior Understanding with Emotional Enhancement

TL;DR

The paper tackles early ASD detection from limited audio-visual data by converting videos into textual behavior scripts and leveraging large language models in zero-shot and few-shot settings. It introduces a three-module SCBU pipeline—Behavior Transcription, Script Transcription, and emotion-aware domain prompting—with multi-LLM collaboration to achieve high diagnostic performance and interpretable rationales. Key contributions include the script transcription framework, emotion textualization of emotional dynamics, and a domain-prompting strategy that injects clinical ASD knowledge. Experimental results on a toddler-age ASD dataset show state-of-the-art zero-shot and few-shot performance (F1 up to 95.24%) and provide explainable detection rationales, suggesting practical potential for clinical screening and decision support.

Abstract

The early diagnosis of autism spectrum disorder (ASD) is critically dependent on systematic observation and analysis of children's social behaviors. While current methodologies predominantly utilize supervised learning approaches, their clinical adoption faces two principal limitations: insufficient ASD diagnostic samples and inadequate interpretability of the detection outcomes. This paper presents a novel zero-shot ASD detection framework based on script-centric behavioral understanding with emotional enhancement, which is designed to overcome the aforementioned clinical constraints. The proposed pipeline automatically converts audio-visual data into structured behavioral text scripts through computer vision techniques, subsequently capitalizing on the generalization capabilities of large language models (LLMs) for zero-shot/few-shot ASD detection. Three core technical contributions are introduced: (1) A multimodal script transcription module transforming behavioral cues into structured textual representations. (2) An emotion textualization module encoding emotional dynamics as the contextual features to augment behavioral understanding. (3) A domain-specific prompt engineering strategy enables the injection of clinical knowledge into LLMs. Our method achieves an F1-score of 95.24\% in diagnosing ASD in children with an average age of two years while generating interpretable detection rationales. This work opens up new avenues for leveraging the power of LLMs in analyzing and understanding ASD-related human behavior, thereby enhancing the accuracy of assisted autism diagnosis.

Paper Structure

This paper contains 23 sections, 2 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: Pipeline comparing (a) behavioral signal processing method , (b) raw-video-based method and (c) The proposed script-centric behavior understanding (SCBU) method
  • Figure 2: The overview of our proposed Script-Centeric Behavior Understanding(SCBD) framework. Behavior Transcription Module converts audio-video data into behavioral logs using multiple well-trained behavior signal processing models. Scipt Transcription Module textualizes Behavior Logs in steam and integrate domain prompt. Large Language Models are used to understand and anwser script content.
  • Figure 3: A two-stage pipeline in the behavioral transcription module. (1) Multi-person identification and localization is used to locate the location information and identify the participant in each frame. (2) Single-person behavior perception is used to perceive the behavioral information of each individual
  • Figure 4: The textualization process of the Response to Name paradigm.
  • Figure 5: The emotion textualization process of the Response to Name paradigm. The blue line indicates the valence value. The red dotted line indicates the moment when the doctor call child's name. The red dots indicate points of emotional dynamic, and the black dashed interval highlights the segment where the child responded.
  • ...and 7 more figures