Table of Contents
Fetching ...

Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental Healthcare

Max Lamparth, Declan Grabb, Amy Franks, Scott Gershan, Kaitlyn N. Kunstman, Aaron Lulla, Monika Drummond Roots, Manu Sharma, Aryan Shrivastava, Nina Vasan, Colleen Waickman

TL;DR

MENTAT introduces a clinician-annotated dataset to evaluate real-world psychiatric decision-making beyond board-exam-style questions. It spans five domains (diagnosis, treatment, monitoring, triage, documentation) and uses demographic variables to study bias, with annotated ambiguity captured via a preference framework. Annotations are converted into probabilistic preferences with a hierarchical Bradley-Terry model, where the probability that option i is preferred over j for annotator a follows $P(i \succ j \mid a) = \frac{1}{1 + \exp(- (\gamma_a + \alpha_a (\beta_i - \beta_j)))}$, enabling soft labels and calibration. Experiments show strong performance in diagnosis and treatment but limited performance in triage and documentation, and reveal demographic biases, underscoring the need for bias mitigation and safe deployment guidelines; MENTAT thus serves as an open-source, evaluation-focused foundation for improving AI in mental health.

Abstract

Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions. Thus, we present an expert-created and annotated dataset spanning five critical domains of decision-making in mental healthcare: treatment, diagnosis, documentation, monitoring, and triage. This dataset - created without any LM assistance - is designed to capture the nuanced clinical reasoning and daily ambiguities mental health practitioners encounter, reflecting the inherent complexities of care delivery that are missing from existing datasets. Almost all 203 base questions with five answer options each have had the decision-irrelevant demographic patient information removed and replaced with variables (e.g., AGE), and are available for male, female, or non-binary-coded patients. For question categories dealing with ambiguity and multiple valid answer options, we create a preference dataset with uncertainties from the expert annotations. We outline a series of intended use cases and demonstrate the usability of our dataset by evaluating eleven off-the-shelf and four mental health fine-tuned LMs on category-specific task accuracy, on the impact of patient demographic information on decision-making, and how consistently free-form responses deviate from human annotated samples.

Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental Healthcare

TL;DR

MENTAT introduces a clinician-annotated dataset to evaluate real-world psychiatric decision-making beyond board-exam-style questions. It spans five domains (diagnosis, treatment, monitoring, triage, documentation) and uses demographic variables to study bias, with annotated ambiguity captured via a preference framework. Annotations are converted into probabilistic preferences with a hierarchical Bradley-Terry model, where the probability that option i is preferred over j for annotator a follows , enabling soft labels and calibration. Experiments show strong performance in diagnosis and treatment but limited performance in triage and documentation, and reveal demographic biases, underscoring the need for bias mitigation and safe deployment guidelines; MENTAT thus serves as an open-source, evaluation-focused foundation for improving AI in mental health.

Abstract

Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions. Thus, we present an expert-created and annotated dataset spanning five critical domains of decision-making in mental healthcare: treatment, diagnosis, documentation, monitoring, and triage. This dataset - created without any LM assistance - is designed to capture the nuanced clinical reasoning and daily ambiguities mental health practitioners encounter, reflecting the inherent complexities of care delivery that are missing from existing datasets. Almost all 203 base questions with five answer options each have had the decision-irrelevant demographic patient information removed and replaced with variables (e.g., AGE), and are available for male, female, or non-binary-coded patients. For question categories dealing with ambiguity and multiple valid answer options, we create a preference dataset with uncertainties from the expert annotations. We outline a series of intended use cases and demonstrate the usability of our dataset by evaluating eleven off-the-shelf and four mental health fine-tuned LMs on category-specific task accuracy, on the impact of patient demographic information on decision-making, and how consistently free-form responses deviate from human annotated samples.

Paper Structure

This paper contains 24 sections, 3 equations, 22 figures, 5 tables.

Figures (22)

  • Figure 1: Designed and annotated by mental health clinicians, the MENTAT (MENtal health Tasks AssessmenT) dataset contains 203 base questions and answers of day-to-day mental healthcare decision-making across five categories: Diagnosis, documentation, treatment, triage, and monitoring, offers variation of non-decision-relevant patient demographic information, and captures task-specific ambiguity in the uncertainty of expert preferences.
  • Figure 2: (Top) Mean annotation score example with 95% confidence interval aggregated over all annotations for question 127 from the triage category. (Bottom) Resulting preference probabilities calculated via hierarchical Bradley-Terry model to be used as evaluation labels, e.g., to calculate accuracy or cross-entropy loss.
  • Figure 3: Comparing the probability for the original creator truth answer to be in the top-$k$ answers as defined by their preference probability when using a regular or a hierarchical Bradley-Terry model.
  • Figure 4: Using the core dataset of MENTAT ($\mathcal{D}_0$), we evaluate eleven off-the-shelf instruction-tuned and three (mental) healthcare fine-tuned models for their task-specific accuracy. The random baseline is 0.2 due to all questions having five answer options.
  • Figure 5: USMLE board exam question example
  • ...and 17 more figures