Table of Contents
Fetching ...

MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models

Suraj Racha, Prashant Joshi, Anshika Raman, Nikita Jangid, Mridul Sharma, Ganesh Ramakrishnan, Nirmal Punjabi

TL;DR

MHQA introduces a large, PubMed-abstract–based multi-domain mental-health QA dataset with four question types (factoid, diagnostic, prognostic, preventive), complemented by a gold sub-collection (MHQA-Gold) and a pseudo-labeled cohort (MHQA-B). It details a generation and validation pipeline, including post hoc checks and expert annotation, to ground QA pairs in scientific knowledge. Benchmarking across GPT-4o, other LLMs, and discriminative models with and without supervised finetuning demonstrates strong but imperfect performance, with factoid questions proving the hardest and OCD topics presenting notable difficulty. The work establishes a foundation for knowledge-intensive mental-health QA research and points to future enhancements such as retrieval-augmented methods and knowledge graphs to push beyond current capabilities.

Abstract

Mental health remains a challenging problem all over the world, with issues like depression, anxiety becoming increasingly common. Large Language Models (LLMs) have seen a vast application in healthcare, specifically in answering medical questions. However, there is a lack of standard benchmarking datasets for question answering (QA) in mental health. Our work presents a novel multiple choice dataset, MHQA (Mental Health Question Answering), for benchmarking Language models (LMs). Previous mental health datasets have focused primarily on text classification into specific labels or disorders. MHQA, on the other hand, presents question-answering for mental health focused on four key domains: anxiety, depression, trauma, and obsessive/compulsive issues, with diverse question types, namely, factoid, diagnostic, prognostic, and preventive. We use PubMed abstracts as the primary source for QA. We develop a rigorous pipeline for LLM-based identification of information from abstracts based on various selection criteria and converting it into QA pairs. Further, valid QA pairs are extracted based on post-hoc validation criteria. Overall, our MHQA dataset consists of 2,475 expert-verified gold standard instances called MHQA-gold and ~56.1k pairs pseudo labeled using external medical references. We report F1 scores on different LLMs along with few-shot and supervised fine-tuning experiments, further discussing the insights for the scores.

MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models

TL;DR

MHQA introduces a large, PubMed-abstract–based multi-domain mental-health QA dataset with four question types (factoid, diagnostic, prognostic, preventive), complemented by a gold sub-collection (MHQA-Gold) and a pseudo-labeled cohort (MHQA-B). It details a generation and validation pipeline, including post hoc checks and expert annotation, to ground QA pairs in scientific knowledge. Benchmarking across GPT-4o, other LLMs, and discriminative models with and without supervised finetuning demonstrates strong but imperfect performance, with factoid questions proving the hardest and OCD topics presenting notable difficulty. The work establishes a foundation for knowledge-intensive mental-health QA research and points to future enhancements such as retrieval-augmented methods and knowledge graphs to push beyond current capabilities.

Abstract

Mental health remains a challenging problem all over the world, with issues like depression, anxiety becoming increasingly common. Large Language Models (LLMs) have seen a vast application in healthcare, specifically in answering medical questions. However, there is a lack of standard benchmarking datasets for question answering (QA) in mental health. Our work presents a novel multiple choice dataset, MHQA (Mental Health Question Answering), for benchmarking Language models (LMs). Previous mental health datasets have focused primarily on text classification into specific labels or disorders. MHQA, on the other hand, presents question-answering for mental health focused on four key domains: anxiety, depression, trauma, and obsessive/compulsive issues, with diverse question types, namely, factoid, diagnostic, prognostic, and preventive. We use PubMed abstracts as the primary source for QA. We develop a rigorous pipeline for LLM-based identification of information from abstracts based on various selection criteria and converting it into QA pairs. Further, valid QA pairs are extracted based on post-hoc validation criteria. Overall, our MHQA dataset consists of 2,475 expert-verified gold standard instances called MHQA-gold and ~56.1k pairs pseudo labeled using external medical references. We report F1 scores on different LLMs along with few-shot and supervised fine-tuning experiments, further discussing the insights for the scores.

Paper Structure

This paper contains 31 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: An instance of MHQA dataset: Questions are categorized into four topics, each followed by four answer choices, with one correct option
  • Figure 2: Overall Framework used for dataset generation. (A) Shows selection of abstracts and conversion into questions. (B) A post-hoc validation process to remove inconsistent questions. (C) Human and Pseudo annotation process for correction of inconsistent options.
  • Figure 3: Distribution of different question types in 2K random samples of MHQA dataset.