Table of Contents
Fetching ...

MC-ViViT: Multi-branch Classifier-ViViT to detect Mild Cognitive Impairment in older adults using facial videos

Jian Sun, Hiroko H. Dodge, Mohammad H. Mahoor

TL;DR

MC-ViViT introduces a Transformer-based framework for MCI detection from facial videos by integrating a ViViT backbone with a Multi-branch Classifier and a novel HP Loss to mitigate inter- and intra-class imbalances in the I-CONECT dataset. The method leverages tubelet embeddings and a Factorised Encoder to extract spatio-temporal features, while the MC module enriches representations and HP Loss balances hard samples and batch correlations. Ablation studies demonstrate that both the MC and HP Loss contribute meaningfully to accuracy, and cross-theme experiments show reasonable generalization across interview themes. This video-based, non-invasive approach offers a scalable alternative to MRI- and questionnaire-based diagnostics, with practical implications for low-cost cognitive health screening and monitoring.

Abstract

Deep machine learning models including Convolutional Neural Networks (CNN) have been successful in the detection of Mild Cognitive Impairment (MCI) using medical images, questionnaires, and videos. This paper proposes a novel Multi-branch Classifier-Video Vision Transformer (MC-ViViT) model to distinguish MCI from those with normal cognition by analyzing facial features. The data comes from the I-CONECT, a behavioral intervention trial aimed at improving cognitive function by providing frequent video chats. MC-ViViT extracts spatiotemporal features of videos in one branch and augments representations by the MC module. The I-CONECT dataset is challenging as the dataset is imbalanced containing Hard-Easy and Positive-Negative samples, which impedes the performance of MC-ViViT. We propose a loss function for Hard-Easy and Positive-Negative Samples (HP Loss) by combining Focal loss and AD-CORRE loss to address the imbalanced problem. Our experimental results on the I-CONECT dataset show the great potential of MC-ViViT in predicting MCI with a high accuracy of 90.63% accuracy on some of the interview videos.

MC-ViViT: Multi-branch Classifier-ViViT to detect Mild Cognitive Impairment in older adults using facial videos

TL;DR

MC-ViViT introduces a Transformer-based framework for MCI detection from facial videos by integrating a ViViT backbone with a Multi-branch Classifier and a novel HP Loss to mitigate inter- and intra-class imbalances in the I-CONECT dataset. The method leverages tubelet embeddings and a Factorised Encoder to extract spatio-temporal features, while the MC module enriches representations and HP Loss balances hard samples and batch correlations. Ablation studies demonstrate that both the MC and HP Loss contribute meaningfully to accuracy, and cross-theme experiments show reasonable generalization across interview themes. This video-based, non-invasive approach offers a scalable alternative to MRI- and questionnaire-based diagnostics, with practical implications for low-cost cognitive health screening and monitoring.

Abstract

Deep machine learning models including Convolutional Neural Networks (CNN) have been successful in the detection of Mild Cognitive Impairment (MCI) using medical images, questionnaires, and videos. This paper proposes a novel Multi-branch Classifier-Video Vision Transformer (MC-ViViT) model to distinguish MCI from those with normal cognition by analyzing facial features. The data comes from the I-CONECT, a behavioral intervention trial aimed at improving cognitive function by providing frequent video chats. MC-ViViT extracts spatiotemporal features of videos in one branch and augments representations by the MC module. The I-CONECT dataset is challenging as the dataset is imbalanced containing Hard-Easy and Positive-Negative samples, which impedes the performance of MC-ViViT. We propose a loss function for Hard-Easy and Positive-Negative Samples (HP Loss) by combining Focal loss and AD-CORRE loss to address the imbalanced problem. Our experimental results on the I-CONECT dataset show the great potential of MC-ViViT in predicting MCI with a high accuracy of 90.63% accuracy on some of the interview videos.
Paper Structure (27 sections, 8 equations, 7 figures, 7 tables)

This paper contains 27 sections, 8 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The structure of the proposed MC-ViViT. D is the model depth. It represents the layer number of MC-ViViT, which is the total number of Spatial Transformers and Temporal Transformers. Layer Norm is Layer Normalization.
  • Figure 2: The structure of Tubelet Embedding.
  • Figure 3: The structure of FE. CLS is class token. The green and purple capsules are positional embeddings. The rest capsules are tubelet embedding. MC is a Multi-branch classifier.
  • Figure 4: The structure of MC. $\oplus$ represents concatenation.
  • Figure 5: Two sample frames from the video dataset. In (a), the window of the interviewee is bigger than that of the interviewer because the interviewer was speaking. Conversely, in (b), the interviewer was talking so that her window is bigger than the interviewee's.
  • ...and 2 more figures