Detection of Mild Cognitive Impairment Using Facial Features in Video Conversations
Muath Alsuhaibani, Hiroko H. Dodge, Mohammad H. Mahoor
TL;DR
Early, non-invasive detection of Mild Cognitive Impairment (MCI) in community settings is addressed by a two-stage pipeline that uses a $128$-dimensional latent facial representation from a CAE and a transformer-based temporal model. Temporal encoding is achieved with segments and sequences using positional embeddings $P = P_M + P_S + P_p$, operating on video frames downsampled to $10$ fps, and a four-layer transformer with a classification token. The best configuration yields about $88\%$ accuracy and $AUC\approx0.87$, outperforming non-temporal baselines and competitive with prior modality-based methods on the I-CONECT dataset. This approach demonstrates a scalable, non-invasive screening path for MCI in real-world home settings, with future work including automated video quality assessment and multimodal fusion with speech data.
Abstract
Early detection of Mild Cognitive Impairment (MCI) leads to early interventions to slow the progression from MCI into dementia. Deep Learning (DL) algorithms could help achieve early non-invasive, low-cost detection of MCI. This paper presents the detection of MCI in older adults using DL models based only on facial features extracted from video-recorded conversations at home. We used the data collected from the I-CONECT behavioral intervention study (NCT02871921), where several sessions of semi-structured interviews between socially isolated older individuals and interviewers were video recorded. We develop a framework that extracts spatial holistic facial features using a convolutional autoencoder and temporal information using transformers. Our proposed DL model was able to detect the I-CONECT study participants' cognitive conditions (MCI vs. those with normal cognition (NC)) using facial features. The segments and sequence information of the facial features improved the prediction performance compared with the non-temporal features. The detection accuracy using this combined method reached 88% whereas 84% is the accuracy without applying the segments and sequences information of the facial features within a video on a certain theme.
