MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models

Suraj Racha; Prashant Joshi; Anshika Raman; Nikita Jangid; Mridul Sharma; Ganesh Ramakrishnan; Nirmal Punjabi

MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models

Suraj Racha, Prashant Joshi, Anshika Raman, Nikita Jangid, Mridul Sharma, Ganesh Ramakrishnan, Nirmal Punjabi

TL;DR

MHQA introduces a large, PubMed-abstract–based multi-domain mental-health QA dataset with four question types (factoid, diagnostic, prognostic, preventive), complemented by a gold sub-collection (MHQA-Gold) and a pseudo-labeled cohort (MHQA-B). It details a generation and validation pipeline, including post hoc checks and expert annotation, to ground QA pairs in scientific knowledge. Benchmarking across GPT-4o, other LLMs, and discriminative models with and without supervised finetuning demonstrates strong but imperfect performance, with factoid questions proving the hardest and OCD topics presenting notable difficulty. The work establishes a foundation for knowledge-intensive mental-health QA research and points to future enhancements such as retrieval-augmented methods and knowledge graphs to push beyond current capabilities.

Abstract

Mental health remains a challenging problem all over the world, with issues like depression, anxiety becoming increasingly common. Large Language Models (LLMs) have seen a vast application in healthcare, specifically in answering medical questions. However, there is a lack of standard benchmarking datasets for question answering (QA) in mental health. Our work presents a novel multiple choice dataset, MHQA (Mental Health Question Answering), for benchmarking Language models (LMs). Previous mental health datasets have focused primarily on text classification into specific labels or disorders. MHQA, on the other hand, presents question-answering for mental health focused on four key domains: anxiety, depression, trauma, and obsessive/compulsive issues, with diverse question types, namely, factoid, diagnostic, prognostic, and preventive. We use PubMed abstracts as the primary source for QA. We develop a rigorous pipeline for LLM-based identification of information from abstracts based on various selection criteria and converting it into QA pairs. Further, valid QA pairs are extracted based on post-hoc validation criteria. Overall, our MHQA dataset consists of 2,475 expert-verified gold standard instances called MHQA-gold and ~56.1k pairs pseudo labeled using external medical references. We report F1 scores on different LLMs along with few-shot and supervised fine-tuning experiments, further discussing the insights for the scores.

MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models

TL;DR

Abstract

MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)