Table of Contents
Fetching ...

Copiloting Diagnosis of Autism in Real Clinical Scenarios via LLMs

Yi Jiang, Qingyang Shen, Shuzhong Lai, Shunyu Qi, Qian Zheng, Lin Yao, Yueming Wang, Gang Pan

TL;DR

A framework called ADOS-Copilot is proposed, which strikes a balance between scoring and explanation and explored the factors that influence the performance of LLMs in this task and systematically elucidated the strengths and limitations of current LLMs in this task from the perspectives of ADOS-2.

Abstract

Autism spectrum disorder(ASD) is a pervasive developmental disorder that significantly impacts the daily functioning and social participation of individuals. Despite the abundance of research focused on supporting the clinical diagnosis of ASD, there is still a lack of systematic and comprehensive exploration in the field of methods based on Large Language Models (LLMs), particularly regarding the real-world clinical diagnostic scenarios based on Autism Diagnostic Observation Schedule, Second Edition (ADOS-2). Therefore, we have proposed a framework called ADOS-Copilot, which strikes a balance between scoring and explanation and explored the factors that influence the performance of LLMs in this task. The experimental results indicate that our proposed framework is competitive with the diagnostic results of clinicians, with a minimum MAE of 0.4643, binary classification F1-score of 81.79\%, and ternary classification F1-score of 78.37\%. Furthermore, we have systematically elucidated the strengths and limitations of current LLMs in this task from the perspectives of ADOS-2, LLMs' capabilities, language, and model scale aiming to inspire and guide the future application of LLMs in a broader fields of mental health disorders. We hope for more research to be transferred into real clinical practice, opening a window of kindness to the world for eccentric children.

Copiloting Diagnosis of Autism in Real Clinical Scenarios via LLMs

TL;DR

A framework called ADOS-Copilot is proposed, which strikes a balance between scoring and explanation and explored the factors that influence the performance of LLMs in this task and systematically elucidated the strengths and limitations of current LLMs in this task from the perspectives of ADOS-2.

Abstract

Autism spectrum disorder(ASD) is a pervasive developmental disorder that significantly impacts the daily functioning and social participation of individuals. Despite the abundance of research focused on supporting the clinical diagnosis of ASD, there is still a lack of systematic and comprehensive exploration in the field of methods based on Large Language Models (LLMs), particularly regarding the real-world clinical diagnostic scenarios based on Autism Diagnostic Observation Schedule, Second Edition (ADOS-2). Therefore, we have proposed a framework called ADOS-Copilot, which strikes a balance between scoring and explanation and explored the factors that influence the performance of LLMs in this task. The experimental results indicate that our proposed framework is competitive with the diagnostic results of clinicians, with a minimum MAE of 0.4643, binary classification F1-score of 81.79\%, and ternary classification F1-score of 78.37\%. Furthermore, we have systematically elucidated the strengths and limitations of current LLMs in this task from the perspectives of ADOS-2, LLMs' capabilities, language, and model scale aiming to inspire and guide the future application of LLMs in a broader fields of mental health disorders. We hope for more research to be transferred into real clinical practice, opening a window of kindness to the world for eccentric children.
Paper Structure (56 sections, 2 equations, 12 figures, 15 tables)

This paper contains 56 sections, 2 equations, 12 figures, 15 tables.

Figures (12)

  • Figure 1: Overall pipeline of our framework. Firstly, we preprocess and transcribe the complete long dialogue text based on the original recordings from clinical ASD diagnoses. Then, we design prompts for In-context Enhancement, using the long text dialogue as input for LLMs to generate scores and justifications for the eight items used in the ADOS clinical diagnoses: A4, A7, A8, B4, B7, B9, B10, and B11. Furthermore, we also feed the long dialogue text into a rule-based method to extract features and then integrated with the results from LLMs, combining the strengths of both approaches to achieve the best scoring performance. Finally, in order to delve deeper into the meaning behind the scores and assist in clinical decision-making, we further input the output fusion results and the original long dialogue text into LLMs to obtain more detailed textual support and additional explanatory evaluations via Interpretability Augmentation in second stage.
  • Figure 2: Case study for our framework generated by Qwen1.5-32b. The left part refers to the Scoring&Explanation Stage, where the explanations are relatively general. The right part refers to the Interpretability Augmentation Stage, which include truncated segments from the original dialogue that support the scoring decisions made by our framework. Upon analyzing the output text, the blue part refers to the references to the ADOS-2-M3 scoring criteria, the green part signifies scoring consistency, and the red part indicates several original dialogue segments that support the scoring decisions made by our framework.
  • Figure 3: Case study for LLMs'preference Analysis. We provided the scoring items of input and corresponding outputs generated in first stage of our framework. The blue font in the figure indicates the basis for scoring.
  • Figure 4: In-context Enhancement prompt for ADOS-Copilot. Where $Prompt_{criteria}$ refers to the prompt of clinical ADOS criteria, $Prompt_{m3}$ refers to the prompt of clinical ADOS-2 Module 3 procedures, $Prompt_{stat}$ refers to the prior information of the ASD and TD children, $Transcript_i$ refers to the pro-processing dialogue texts between doctor and child.
  • Figure 5: Concise vs Standard Criteria of ADOS-2. In the ablation experiment, we compared the effects of the concise criteria and standard criteria. This Figure provides specific examples of both to better understand the differences in their effects.
  • ...and 7 more figures