Table of Contents
Fetching ...

Exploring In-Context Learning Capabilities of ChatGPT for Pathological Speech Detection

Mahdi Amiri, Hatef Otroshi Shahreza, Ina Kodrasi

TL;DR

This paper addresses the interpretability gap in automatic pathological speech detection by leveraging multimodal LLMs. It investigates ChatGPT-4o in a few-shot in-context-learning setup using STFT magnitude spectrograms to detect dysarthria, producing not only classifications but also explanations. Evaluations on the Noise Reduced UA-Speech Dysarthria dataset show competitive performance relative to a SOTA CNN baseline, with the added advantage of interpretability through generated explanations. The study also conducts ablations on system prompts and input modalities, highlighting the potential and current limitations of multimodal LLMs for clinically relevant, explainable speech pathology detection and outlining directions for future improvements in explanation quality and prompt design.

Abstract

Automatic pathological speech detection approaches have shown promising results, gaining attention as potential diagnostic tools alongside costly traditional methods. While these approaches can achieve high accuracy, their lack of interpretability limits their applicability in clinical practice. In this paper, we investigate the use of multimodal Large Language Models (LLMs), specifically ChatGPT-4o, for automatic pathological speech detection in a few-shot in-context learning setting. Experimental results show that this approach not only delivers promising performance but also provides explanations for its decisions, enhancing model interpretability. To further understand its effectiveness, we conduct an ablation study to analyze the impact of different factors, such as input type and system prompts, on the final results. Our findings highlight the potential of multimodal LLMs for further exploration and advancement in automatic pathological speech detection.

Exploring In-Context Learning Capabilities of ChatGPT for Pathological Speech Detection

TL;DR

This paper addresses the interpretability gap in automatic pathological speech detection by leveraging multimodal LLMs. It investigates ChatGPT-4o in a few-shot in-context-learning setup using STFT magnitude spectrograms to detect dysarthria, producing not only classifications but also explanations. Evaluations on the Noise Reduced UA-Speech Dysarthria dataset show competitive performance relative to a SOTA CNN baseline, with the added advantage of interpretability through generated explanations. The study also conducts ablations on system prompts and input modalities, highlighting the potential and current limitations of multimodal LLMs for clinically relevant, explainable speech pathology detection and outlining directions for future improvements in explanation quality and prompt design.

Abstract

Automatic pathological speech detection approaches have shown promising results, gaining attention as potential diagnostic tools alongside costly traditional methods. While these approaches can achieve high accuracy, their lack of interpretability limits their applicability in clinical practice. In this paper, we investigate the use of multimodal Large Language Models (LLMs), specifically ChatGPT-4o, for automatic pathological speech detection in a few-shot in-context learning setting. Experimental results show that this approach not only delivers promising performance but also provides explanations for its decisions, enhancing model interpretability. To further understand its effectiveness, we conduct an ablation study to analyze the impact of different factors, such as input type and system prompts, on the final results. Our findings highlight the potential of multimodal LLMs for further exploration and advancement in automatic pathological speech detection.

Paper Structure

This paper contains 16 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Schematic illustration of the proposed method. We set a system prompt that describes the classification task, input representation, and the number of reference samples per class. Then, several samples from each class are provided to the model, which is asked to classify the test sample based on them. In response, the model returns a classification score and explains the reasoning behind its decision.