Table of Contents
Fetching ...

NeuroXVocal: Detection and Explanation of Alzheimer's Disease through Non-invasive Analysis of Picture-prompted Speech

Nikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda, Magda Tsolaki, Vasileios Argyriou, Panagiotis Sarigianndis

TL;DR

NeuroXVocal tackles early, non-invasive Alzheimer's Disease diagnosis from picture-prompted speech by coupling a multimodal transformer-based classifier with a retrieval-augmented, literature-grounded explainer. The Neuro classifier fuses acoustic features, Wav2Vec2 embeddings, and Whisper-DeBERTa textual features through a two-layer transformer to achieve state-of-the-art accuracy on the ADReSSo benchmark (95.77% on held-out test) and strong cross-validation performance. The XVocal component uses a RAG approach with dense retrieval from a domain knowledge base and FLAN-T5 to generate evidence-based explanations of detected markers, validated by clinicians. Together, NeuroXVocal offers high-accuracy AD detection plus clinically actionable, literature-backed explanations, with plans for real-time deployment and continuous knowledge updates to support clinical decision-making.

Abstract

The early diagnosis of Alzheimer's Disease (AD) through non invasive methods remains a significant healthcare challenge. We present NeuroXVocal, a novel dual-component system that not only classifies but also explains potential AD cases through speech analysis. The classification component (Neuro) processes three distinct data streams: acoustic features capturing speech patterns and voice characteristics, textual features extracted from speech transcriptions, and precomputed embeddings representing linguistic patterns. These streams are fused through a custom transformer-based architecture that enables robust cross-modal interactions. The explainability component (XVocal) implements a Retrieval-Augmented Generation (RAG) approach, leveraging Large Language Models combined with a domain-specific knowledge base of AD research literature. This architecture enables XVocal to retrieve relevant clinical studies and research findings to generate evidence-based context-sensitive explanations of the acoustic and linguistic markers identified in patient speech. Using the IS2021 ADReSSo Challenge benchmark dataset, our system achieved state-of-the-art performance with 95.77% accuracy in AD classification, significantly outperforming previous approaches. The explainability component was qualitatively evaluated using a structured questionnaire completed by medical professionals, validating its clinical relevance. NeuroXVocal's unique combination of high-accuracy classification and interpretable, literature-grounded explanations demonstrates its potential as a practical tool for supporting clinical AD diagnosis.

NeuroXVocal: Detection and Explanation of Alzheimer's Disease through Non-invasive Analysis of Picture-prompted Speech

TL;DR

NeuroXVocal tackles early, non-invasive Alzheimer's Disease diagnosis from picture-prompted speech by coupling a multimodal transformer-based classifier with a retrieval-augmented, literature-grounded explainer. The Neuro classifier fuses acoustic features, Wav2Vec2 embeddings, and Whisper-DeBERTa textual features through a two-layer transformer to achieve state-of-the-art accuracy on the ADReSSo benchmark (95.77% on held-out test) and strong cross-validation performance. The XVocal component uses a RAG approach with dense retrieval from a domain knowledge base and FLAN-T5 to generate evidence-based explanations of detected markers, validated by clinicians. Together, NeuroXVocal offers high-accuracy AD detection plus clinically actionable, literature-backed explanations, with plans for real-time deployment and continuous knowledge updates to support clinical decision-making.

Abstract

The early diagnosis of Alzheimer's Disease (AD) through non invasive methods remains a significant healthcare challenge. We present NeuroXVocal, a novel dual-component system that not only classifies but also explains potential AD cases through speech analysis. The classification component (Neuro) processes three distinct data streams: acoustic features capturing speech patterns and voice characteristics, textual features extracted from speech transcriptions, and precomputed embeddings representing linguistic patterns. These streams are fused through a custom transformer-based architecture that enables robust cross-modal interactions. The explainability component (XVocal) implements a Retrieval-Augmented Generation (RAG) approach, leveraging Large Language Models combined with a domain-specific knowledge base of AD research literature. This architecture enables XVocal to retrieve relevant clinical studies and research findings to generate evidence-based context-sensitive explanations of the acoustic and linguistic markers identified in patient speech. Using the IS2021 ADReSSo Challenge benchmark dataset, our system achieved state-of-the-art performance with 95.77% accuracy in AD classification, significantly outperforming previous approaches. The explainability component was qualitatively evaluated using a structured questionnaire completed by medical professionals, validating its clinical relevance. NeuroXVocal's unique combination of high-accuracy classification and interpretable, literature-grounded explanations demonstrates its potential as a practical tool for supporting clinical AD diagnosis.

Paper Structure

This paper contains 11 sections, 14 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: NeuroXVocal Architecture