Table of Contents
Fetching ...

AI-Driven Early Mental Health Screening: Analyzing Selfies of Pregnant Women

Gustavo A. Basílio, Thiago B. Pereira, Alessandro L. Koerich, Hermano Tavares, Ludmila Dias, Maria das Graças da S. Teixeira, Rafael T. Sousa, Wilian H. Hisatugu, Amanda S. Mota, Anilton S. Garcia, Marco Aurélio K. Galletta, Thiago M. Paixão

TL;DR

This study investigates AI-driven screening of depression and anxiety in high-risk pregnant patients using front-facing selfies, guided by PHQ-4 responses. It compares a traditional CNN-based transfer-learning pipeline with a novel VLM-based pipeline that generates textual facial descriptions (via VLMs and Sentence-BERT embeddings) to drive a lightweight FFNN classifier, plus a zero-shot baseline using GPT-4o. On a dataset of 108 participants and 147 selfies, evaluated with Leave-One-Subject-Out cross-validation, the VLM-based approach substantially outperforms the CNNs, achieving an accuracy of 77.6% and an F1-score of 56.0%, with GPT-4o delivering the best overall performance among tested models. The work demonstrates the potential and limitations of selfie-based AI screening for maternal mental health, highlighting practical avenues such as larger, multi-shot data and multi-modal fusion to improve reliability and clinical utility.

Abstract

Major Depressive Disorder and anxiety disorders affect millions globally, contributing significantly to the burden of mental health issues. Early screening is crucial for effective intervention, as timely identification of mental health issues can significantly improve treatment outcomes. Artificial intelligence (AI) can be valuable for improving the screening of mental disorders, enabling early intervention and better treatment outcomes. AI-driven screening can leverage the analysis of multiple data sources, including facial features in digital images. However, existing methods often rely on controlled environments or specialized equipment, limiting their broad applicability. This study explores the potential of AI models for ubiquitous depression-anxiety screening given face-centric selfies. The investigation focuses on high-risk pregnant patients, a population that is particularly vulnerable to mental health issues. To cope with limited training data resulting from our clinical setup, pre-trained models were utilized in two different approaches: fine-tuning convolutional neural networks (CNNs) originally designed for facial expression recognition and employing vision-language models (VLMs) for zero-shot analysis of facial expressions. Experimental results indicate that the proposed VLM-based method significantly outperforms CNNs, achieving an accuracy of 77.6%. Although there is significant room for improvement, the results suggest that VLMs can be a promising approach for mental health screening.

AI-Driven Early Mental Health Screening: Analyzing Selfies of Pregnant Women

TL;DR

This study investigates AI-driven screening of depression and anxiety in high-risk pregnant patients using front-facing selfies, guided by PHQ-4 responses. It compares a traditional CNN-based transfer-learning pipeline with a novel VLM-based pipeline that generates textual facial descriptions (via VLMs and Sentence-BERT embeddings) to drive a lightweight FFNN classifier, plus a zero-shot baseline using GPT-4o. On a dataset of 108 participants and 147 selfies, evaluated with Leave-One-Subject-Out cross-validation, the VLM-based approach substantially outperforms the CNNs, achieving an accuracy of 77.6% and an F1-score of 56.0%, with GPT-4o delivering the best overall performance among tested models. The work demonstrates the potential and limitations of selfie-based AI screening for maternal mental health, highlighting practical avenues such as larger, multi-shot data and multi-modal fusion to improve reliability and clinical utility.

Abstract

Major Depressive Disorder and anxiety disorders affect millions globally, contributing significantly to the burden of mental health issues. Early screening is crucial for effective intervention, as timely identification of mental health issues can significantly improve treatment outcomes. Artificial intelligence (AI) can be valuable for improving the screening of mental disorders, enabling early intervention and better treatment outcomes. AI-driven screening can leverage the analysis of multiple data sources, including facial features in digital images. However, existing methods often rely on controlled environments or specialized equipment, limiting their broad applicability. This study explores the potential of AI models for ubiquitous depression-anxiety screening given face-centric selfies. The investigation focuses on high-risk pregnant patients, a population that is particularly vulnerable to mental health issues. To cope with limited training data resulting from our clinical setup, pre-trained models were utilized in two different approaches: fine-tuning convolutional neural networks (CNNs) originally designed for facial expression recognition and employing vision-language models (VLMs) for zero-shot analysis of facial expressions. Experimental results indicate that the proposed VLM-based method significantly outperforms CNNs, achieving an accuracy of 77.6%. Although there is significant room for improvement, the results suggest that VLMs can be a promising approach for mental health screening.
Paper Structure (18 sections, 1 equation, 6 figures, 1 table)

This paper contains 18 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: Patient Health Questionnaire-4 (PHQ-4) items and the respective frequency scores.
  • Figure 2: Overview of the proposed methodology for depression-anxiety detection using selfies and PHQ-4 responses. The training pipeline (top flow) involves filtering invalid selfies, cropping face regions with MTCNN, and labeling images based on PHQ-4 scores. The model (CNN or FFNN) is trained either directly from images (CNN-based) or text descriptions (VLM-based). The test pipeline (bottom flow) uses the trained model to classify new selfies.
  • Figure 3: Zero-shot description generation with VLMs. The VLM prompt consists of an image (cropped face from a selfie) and a text instruction (text prompt). A description is generated for each face image in the image dataset. The label from each source image is transferred to the respective generated description, giving rise to an annotated dataset of textual descriptions.
  • Figure 4: Data distribution after removing invalid samples. Most subjects contributed with a single sample, while one contributed with nine. The imbalance is evident, with a bias towards negative samples (PHQ-4 $<$ 6).
  • Figure 5: Sensitivity analysis. The FFNN classifier (VLM-based) was evaluated with various hidden units: $h=4, 8, 16, \ldots, 256$. The dashed line in each chart represents the highest value for the respective metric as reported in Table \ref{['tab:main']}.
  • ...and 1 more figures