Table of Contents
Fetching ...

Evaluating the Effectiveness of the Foundational Models for Q&A Classification in Mental Health care

Hassan Alhuzali, Ashwag Alasmari

TL;DR

This study evaluates the effectiveness of foundational models for classification of Questions and Answers (Q&A) in the domain of mental health care and concludes that PLMs and prompt-based approaches hold promise for mental health support in Arabic.

Abstract

Pre-trained Language Models (PLMs) have the potential to transform mental health support by providing accessible and culturally sensitive resources. However, despite this potential, their effectiveness in mental health care and specifically for the Arabic language has not been extensively explored. To bridge this gap, this study evaluates the effectiveness of foundational models for classification of Questions and Answers (Q&A) in the domain of mental health care. We leverage the MentalQA dataset, an Arabic collection featuring Q&A interactions related to mental health. In this study, we conducted experiments using four different types of learning approaches: traditional feature extraction, PLMs as feature extractors, Fine-tuning PLMs and prompting large language models (GPT-3.5 and GPT-4) in zero-shot and few-shot learning settings. While traditional feature extractors combined with Support Vector Machines (SVM) showed promising performance, PLMs exhibited even better results due to their ability to capture semantic meaning. For example, MARBERT achieved the highest performance with a Jaccard Score of 0.80 for question classification and a Jaccard Score of 0.86 for answer classification. We further conducted an in-depth analysis including examining the effects of fine-tuning versus non-fine-tuning, the impact of varying data size, and conducting error analysis. Our analysis demonstrates that fine-tuning proved to be beneficial for enhancing the performance of PLMs, and the size of the training data played a crucial role in achieving high performance. We also explored prompting, where few-shot learning with GPT-3.5 yielded promising results. There was an improvement of 12% for question and classification and 45% for answer classification. Based on our findings, it can be concluded that PLMs and prompt-based approaches hold promise for mental health support in Arabic.

Evaluating the Effectiveness of the Foundational Models for Q&A Classification in Mental Health care

TL;DR

This study evaluates the effectiveness of foundational models for classification of Questions and Answers (Q&A) in the domain of mental health care and concludes that PLMs and prompt-based approaches hold promise for mental health support in Arabic.

Abstract

Pre-trained Language Models (PLMs) have the potential to transform mental health support by providing accessible and culturally sensitive resources. However, despite this potential, their effectiveness in mental health care and specifically for the Arabic language has not been extensively explored. To bridge this gap, this study evaluates the effectiveness of foundational models for classification of Questions and Answers (Q&A) in the domain of mental health care. We leverage the MentalQA dataset, an Arabic collection featuring Q&A interactions related to mental health. In this study, we conducted experiments using four different types of learning approaches: traditional feature extraction, PLMs as feature extractors, Fine-tuning PLMs and prompting large language models (GPT-3.5 and GPT-4) in zero-shot and few-shot learning settings. While traditional feature extractors combined with Support Vector Machines (SVM) showed promising performance, PLMs exhibited even better results due to their ability to capture semantic meaning. For example, MARBERT achieved the highest performance with a Jaccard Score of 0.80 for question classification and a Jaccard Score of 0.86 for answer classification. We further conducted an in-depth analysis including examining the effects of fine-tuning versus non-fine-tuning, the impact of varying data size, and conducting error analysis. Our analysis demonstrates that fine-tuning proved to be beneficial for enhancing the performance of PLMs, and the size of the training data played a crucial role in achieving high performance. We also explored prompting, where few-shot learning with GPT-3.5 yielded promising results. There was an improvement of 12% for question and classification and 45% for answer classification. Based on our findings, it can be concluded that PLMs and prompt-based approaches hold promise for mental health support in Arabic.
Paper Structure (23 sections, 1 equation, 5 figures, 4 tables)

This paper contains 23 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Example of two annotated Q&A posts in MentalQA dataset, with each Q&A post translated into English for better readability. The first row represents the questions, while the second row represents the corresponding answers. Additionally, the categories for each question and answer are included.
  • Figure 2: An overview of our experimental design. Specifically, it outlines the process by which the input is conveyed to the design of the learning approach, wherein the resulting outputs of various approaches are linked to the desired task outcome, namely, a multi-label classification of Q/A types.
  • Figure 3: Illustrating the impact of fine-tuning PLMs compared to not fine-tuning them. The x-axis of the plot represents the metrics employed in the paper, while the y-axis represents the corresponding model scores. Additionally, the color bars within the plot indicate with fine-tuning (w/ FT) vs without fine-tuning (w/o FT).
  • Figure 4: Both plots depict the impact of few-shot learning on model performance.
  • Figure 5: Both plots depict the impact of data size on model performance. The x-axis in the plot indicates the number of samples utilized for training, while the y-axis corresponds to the score of each metric.