Table of Contents
Fetching ...

CogniAlign: Word-Level Multimodal Speech Alignment with Gated Cross-Attention for Alzheimer's Detection

David Ortiz-Perez, Manuel Benavent-Lledo, Javier Rodriguez-Juan, Jose Garcia-Rodriguez, David Tomás

TL;DR

The paper addresses early Alzheimer’s detection using non-intrusive audio and text data.It introduces CogniAlign, a word-level alignment framework that uses Whisper-derived timestamps and a Gated Cross-Attention Transformer to fuse modalities, with $\mathbf{H}_{\text{att}} = \mathrm{Attention}(\mathbf{A}, \mathbf{T}, \mathbf{T})$ and $\mathbf{H} = \mathbf{G} \odot \mathbf{H}_{\text{att}} + (1-\mathbf{G}) \odot \mathbf{A}$ as core equations.Results on ADReSSo show state-of-the-art accuracy with LOSO 87.35% and 5-fold CV 90.36%, plus a regression RMSE of 5.28 (LOSO) and 4.77 (5-fold CV).The work highlights the importance of word-level temporal alignment and prosodic cues for robust, non-invasive cognitive health monitoring and points toward future multimodal expansions with weighted token fusion.

Abstract

Early detection of cognitive disorders such as Alzheimer's disease is critical for enabling timely clinical intervention and improving patient outcomes. In this work, we introduce CogniAlign, a multimodal architecture for Alzheimer's detection that integrates audio and textual modalities, two non-intrusive sources of information that offer complementary insights into cognitive health. Unlike prior approaches that fuse modalities at a coarse level, CogniAlign leverages a word-level temporal alignment strategy that synchronizes audio embeddings with corresponding textual tokens based on transcription timestamps. This alignment supports the development of token-level fusion techniques, enabling more precise cross-modal interactions. To fully exploit this alignment, we propose a Gated Cross-Attention Fusion mechanism, where audio features attend over textual representations, guided by the superior unimodal performance of the text modality. In addition, we incorporate prosodic cues, specifically interword pauses, by inserting pause tokens into the text and generating audio embeddings for silent intervals, further enriching both streams. We evaluate CogniAlign on the ADReSSo dataset, where it achieves an accuracy of 87.35% over a Leave-One-Subject-Out setup and of 90.36% over a 5 fold Cross-Validation, outperforming existing state-of-the-art methods. A detailed ablation study confirms the advantages of our alignment strategy, attention-based fusion, and prosodic modeling. Finally, we perform a corpus analysis to assess the impact of the proposed prosodic features and apply Integrated Gradients to identify the most influential input segments used by the model in predicting cognitive health outcomes.

CogniAlign: Word-Level Multimodal Speech Alignment with Gated Cross-Attention for Alzheimer's Detection

TL;DR

The paper addresses early Alzheimer’s detection using non-intrusive audio and text data.It introduces CogniAlign, a word-level alignment framework that uses Whisper-derived timestamps and a Gated Cross-Attention Transformer to fuse modalities, with $\mathbf{H}_{\text{att}} = \mathrm{Attention}(\mathbf{A}, \mathbf{T}, \mathbf{T})$ and $\mathbf{H} = \mathbf{G} \odot \mathbf{H}_{\text{att}} + (1-\mathbf{G}) \odot \mathbf{A}$ as core equations.Results on ADReSSo show state-of-the-art accuracy with LOSO 87.35% and 5-fold CV 90.36%, plus a regression RMSE of 5.28 (LOSO) and 4.77 (5-fold CV).The work highlights the importance of word-level temporal alignment and prosodic cues for robust, non-invasive cognitive health monitoring and points toward future multimodal expansions with weighted token fusion.

Abstract

Early detection of cognitive disorders such as Alzheimer's disease is critical for enabling timely clinical intervention and improving patient outcomes. In this work, we introduce CogniAlign, a multimodal architecture for Alzheimer's detection that integrates audio and textual modalities, two non-intrusive sources of information that offer complementary insights into cognitive health. Unlike prior approaches that fuse modalities at a coarse level, CogniAlign leverages a word-level temporal alignment strategy that synchronizes audio embeddings with corresponding textual tokens based on transcription timestamps. This alignment supports the development of token-level fusion techniques, enabling more precise cross-modal interactions. To fully exploit this alignment, we propose a Gated Cross-Attention Fusion mechanism, where audio features attend over textual representations, guided by the superior unimodal performance of the text modality. In addition, we incorporate prosodic cues, specifically interword pauses, by inserting pause tokens into the text and generating audio embeddings for silent intervals, further enriching both streams. We evaluate CogniAlign on the ADReSSo dataset, where it achieves an accuracy of 87.35% over a Leave-One-Subject-Out setup and of 90.36% over a 5 fold Cross-Validation, outperforming existing state-of-the-art methods. A detailed ablation study confirms the advantages of our alignment strategy, attention-based fusion, and prosodic modeling. Finally, we perform a corpus analysis to assess the impact of the proposed prosodic features and apply Integrated Gradients to identify the most influential input segments used by the model in predicting cognitive health outcomes.

Paper Structure

This paper contains 20 sections, 6 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Overview of the CogniAlign architecture. Audio recordings from the ADReSSo dataset are transcribed with Whisper, extracting word-level timestamps and prosodic cues (pauses). This enables temporal alignment between textual and audio embeddings at the word level. Aligned features are fused through a Gated Cross-Attention Transformer Encoder (TE), shown on the right of the figure, where textual embeddings serve as queries (Q) and audio embeddings as keys/values (K/V). A learnable gating mechanism regulates the integration of attended features, as illustrated in the right panel. An MLP processes fused representations for Alzheimer's detection. Green blocks represent text components, blue blocks represent audio components, and purple blocks denote multimodal fusion. Frozen pre-trained models are indicated with a snowflake symbol. Better viewed in color.
  • Figure 2: Prosodic augmentation pipeline. Pauses, detected using Whisper word-level timestamps (and also visible as silent regions in the waveform), are inserted into the transcription as punctuation marks (comma, period, ellipsis) based on duration. Inserted pauses, shown in green, enhance the original transcription with prosodic cues.
  • Figure 3: Transformer-based fusion strategies explored in this work: (a) Concatenation, (b) Element-wise Fusion (e.g., Sum, Product, or Mean), (c) Self-Attention Fusion, (d) Cross-Attention Fusion (including gated variant), and (e) Bidirectional Cross-Attention Fusion (including gated variant). Blue blocks represent one input modality, green blocks represent the other modality, and purple blocks correspond to fused multimodal representations. Better viewed in color.
  • Figure 4: Integrated Gradients attribution visualization from CogniAlign Model. Tokens contributing positively to the predicted class are shown in green, while those reducing its likelihood are in red. Color intensity corresponds to the relative importance of each token.