MC-ViViT: Multi-branch Classifier-ViViT to detect Mild Cognitive Impairment in older adults using facial videos
Jian Sun, Hiroko H. Dodge, Mohammad H. Mahoor
TL;DR
MC-ViViT introduces a Transformer-based framework for MCI detection from facial videos by integrating a ViViT backbone with a Multi-branch Classifier and a novel HP Loss to mitigate inter- and intra-class imbalances in the I-CONECT dataset. The method leverages tubelet embeddings and a Factorised Encoder to extract spatio-temporal features, while the MC module enriches representations and HP Loss balances hard samples and batch correlations. Ablation studies demonstrate that both the MC and HP Loss contribute meaningfully to accuracy, and cross-theme experiments show reasonable generalization across interview themes. This video-based, non-invasive approach offers a scalable alternative to MRI- and questionnaire-based diagnostics, with practical implications for low-cost cognitive health screening and monitoring.
Abstract
Deep machine learning models including Convolutional Neural Networks (CNN) have been successful in the detection of Mild Cognitive Impairment (MCI) using medical images, questionnaires, and videos. This paper proposes a novel Multi-branch Classifier-Video Vision Transformer (MC-ViViT) model to distinguish MCI from those with normal cognition by analyzing facial features. The data comes from the I-CONECT, a behavioral intervention trial aimed at improving cognitive function by providing frequent video chats. MC-ViViT extracts spatiotemporal features of videos in one branch and augments representations by the MC module. The I-CONECT dataset is challenging as the dataset is imbalanced containing Hard-Easy and Positive-Negative samples, which impedes the performance of MC-ViViT. We propose a loss function for Hard-Easy and Positive-Negative Samples (HP Loss) by combining Focal loss and AD-CORRE loss to address the imbalanced problem. Our experimental results on the I-CONECT dataset show the great potential of MC-ViViT in predicting MCI with a high accuracy of 90.63% accuracy on some of the interview videos.
