Table of Contents
Fetching ...

Dementia Insights: A Context-Based MultiModal Approach

Sahar Sinene Mehdoui, Abdelhamid Bouzid, Daniel Sierra-Sosa, Adel Elmaghraby

TL;DR

The paper tackles non-invasive dementia detection by integrating text and audio data through a context-based multimodal framework that leverages GPT/BERT/CLIP for text and CLAP for audio. It introduces a context-based In-Context Learning (ICL) approach and compares it against a context-aware multimodal model, using the Pitt Corpus Cookie Theft data. Key findings show that GPT+CLAP fusion achieves state-of-the-art performance with an F1 around 83.33%, and that raw transcripts often outperform expert-annotated data, supporting scalable, label-efficient screening. The work demonstrates the value of context conditioning and cross-attention in multimodal cognitive health assessment, paving the way for robust, scalable dementia diagnostics and personalized insights.

Abstract

Dementia, a progressive neurodegenerative disorder, affects memory, reasoning, and daily functioning, creating challenges for individuals and healthcare systems. Early detection is crucial for timely interventions that may slow disease progression. Large pre-trained models (LPMs) for text and audio, such as Generative Pre-trained Transformer (GPT), Bidirectional Encoder Representations from Transformers (BERT), and Contrastive Language-Audio Pretraining (CLAP), have shown promise in identifying cognitive impairments. However, existing studies generally rely heavily on expert-annotated datasets and unimodal approaches, limiting robustness and scalability. This study proposes a context-based multimodal method, integrating both text and audio data using the best-performing LPMs in each modality. By incorporating contextual embeddings, our method improves dementia detection performance. Additionally, motivated by the effectiveness of contextual embeddings, we further experimented with a context-based In-Context Learning (ICL) as a complementary technique. Results show that GPT-based embeddings, particularly when fused with CLAP audio features, achieve an F1-score of $83.33\%$, surpassing state-of-the-art dementia detection models. Furthermore, raw text data outperforms expert-annotated datasets, demonstrating that LPMs can extract meaningful linguistic and acoustic patterns without extensive manual labeling. These findings highlight the potential for scalable, non-invasive diagnostic tools that reduce reliance on costly annotations while maintaining high accuracy. By integrating multimodal learning with contextual embeddings, this work lays the foundation for future advancements in personalized dementia detection and cognitive health research.

Dementia Insights: A Context-Based MultiModal Approach

TL;DR

The paper tackles non-invasive dementia detection by integrating text and audio data through a context-based multimodal framework that leverages GPT/BERT/CLIP for text and CLAP for audio. It introduces a context-based In-Context Learning (ICL) approach and compares it against a context-aware multimodal model, using the Pitt Corpus Cookie Theft data. Key findings show that GPT+CLAP fusion achieves state-of-the-art performance with an F1 around 83.33%, and that raw transcripts often outperform expert-annotated data, supporting scalable, label-efficient screening. The work demonstrates the value of context conditioning and cross-attention in multimodal cognitive health assessment, paving the way for robust, scalable dementia diagnostics and personalized insights.

Abstract

Dementia, a progressive neurodegenerative disorder, affects memory, reasoning, and daily functioning, creating challenges for individuals and healthcare systems. Early detection is crucial for timely interventions that may slow disease progression. Large pre-trained models (LPMs) for text and audio, such as Generative Pre-trained Transformer (GPT), Bidirectional Encoder Representations from Transformers (BERT), and Contrastive Language-Audio Pretraining (CLAP), have shown promise in identifying cognitive impairments. However, existing studies generally rely heavily on expert-annotated datasets and unimodal approaches, limiting robustness and scalability. This study proposes a context-based multimodal method, integrating both text and audio data using the best-performing LPMs in each modality. By incorporating contextual embeddings, our method improves dementia detection performance. Additionally, motivated by the effectiveness of contextual embeddings, we further experimented with a context-based In-Context Learning (ICL) as a complementary technique. Results show that GPT-based embeddings, particularly when fused with CLAP audio features, achieve an F1-score of , surpassing state-of-the-art dementia detection models. Furthermore, raw text data outperforms expert-annotated datasets, demonstrating that LPMs can extract meaningful linguistic and acoustic patterns without extensive manual labeling. These findings highlight the potential for scalable, non-invasive diagnostic tools that reduce reliance on costly annotations while maintaining high accuracy. By integrating multimodal learning with contextual embeddings, this work lays the foundation for future advancements in personalized dementia detection and cognitive health research.

Paper Structure

This paper contains 24 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: The Standardized 'Cookie Theft' picture for assessing cognitive abilities through verbal description. Source: Boston Diagnostic Aphasia Examination, available under https://creativecommons.org/licenses/by/4.0/. Original image retrieved from cookie_theft_image. No changes were made to the original image.
  • Figure 2: In-Context Learning (ICL) Model Architecture for Dementia Classification Using Structured Prompts with Large Language Models.
  • Figure 3: Schematic Representation of the Multimodal Model Integrating Text and Audio Features.
  • Figure 4: Illustration of Cross-Attention Mechanism within the Multimodal Classification Framework.
  • Figure 5: Confusion matrix for one of the 10 folds.