Table of Contents
Fetching ...

HARBOR: Holistic Adaptive Risk assessment model for BehaviORal healthcare

Aditya Siddhant

TL;DR

This work addresses the challenge of producing calibrated, discrete mood/risk scores from multimodal and longitudinal behavioral health data. It introduces PEARL, a four-year, monthly longitudinal dataset across three patients, and HARBOR, a domain-adapted behavioral health–aware LLM trained to predict the Harbor Risk Score (HRS) on a -3 to +3 scale through mid-training, supervised fine-tuning, reinforcement learning, and Self-Taught Reasoning. Across multiple baselines and ablations, HARBOR achieves substantially higher accuracy (0.69) and stronger association with ground-truth mood trajectories than traditional models and off-the-shelf LLMs, while providing calibrated confidence estimates and interpretability aligned with psychiatric practice. The results support the potential of clinically grounded LLMs as decision-support tools in behavioral healthcare, with emphasis on safety, calibration, and robustness to distributional shifts, and point to future work expanding dataset scope and temporal granularity.

Abstract

Behavioral healthcare risk assessment remains a challenging problem due to the highly multimodal nature of patient data and the temporal dynamics of mood and affective disorders. While large language models (LLMs) have demonstrated strong reasoning capabilities, their effectiveness in structured clinical risk scoring remains unclear. In this work, we introduce HARBOR, a behavioral health aware language model designed to predict a discrete mood and risk score, termed the Harbor Risk Score (HRS), on an integer scale from -3 (severe depression) to +3 (mania). We also release PEARL, a longitudinal behavioral healthcare dataset spanning four years of monthly observations from three patients, containing physiological, behavioral, and self reported mental health signals. We benchmark traditional machine learning models, proprietary LLMs, and HARBOR across multiple evaluation settings and ablations. Our results show that HARBOR outperforms classical baselines and off the shelf LLMs, achieving 69 percent accuracy compared to 54 percent for logistic regression and 29 percent for the strongest proprietary LLM baseline.

HARBOR: Holistic Adaptive Risk assessment model for BehaviORal healthcare

TL;DR

This work addresses the challenge of producing calibrated, discrete mood/risk scores from multimodal and longitudinal behavioral health data. It introduces PEARL, a four-year, monthly longitudinal dataset across three patients, and HARBOR, a domain-adapted behavioral health–aware LLM trained to predict the Harbor Risk Score (HRS) on a -3 to +3 scale through mid-training, supervised fine-tuning, reinforcement learning, and Self-Taught Reasoning. Across multiple baselines and ablations, HARBOR achieves substantially higher accuracy (0.69) and stronger association with ground-truth mood trajectories than traditional models and off-the-shelf LLMs, while providing calibrated confidence estimates and interpretability aligned with psychiatric practice. The results support the potential of clinically grounded LLMs as decision-support tools in behavioral healthcare, with emphasis on safety, calibration, and robustness to distributional shifts, and point to future work expanding dataset scope and temporal granularity.

Abstract

Behavioral healthcare risk assessment remains a challenging problem due to the highly multimodal nature of patient data and the temporal dynamics of mood and affective disorders. While large language models (LLMs) have demonstrated strong reasoning capabilities, their effectiveness in structured clinical risk scoring remains unclear. In this work, we introduce HARBOR, a behavioral health aware language model designed to predict a discrete mood and risk score, termed the Harbor Risk Score (HRS), on an integer scale from -3 (severe depression) to +3 (mania). We also release PEARL, a longitudinal behavioral healthcare dataset spanning four years of monthly observations from three patients, containing physiological, behavioral, and self reported mental health signals. We benchmark traditional machine learning models, proprietary LLMs, and HARBOR across multiple evaluation settings and ablations. Our results show that HARBOR outperforms classical baselines and off the shelf LLMs, achieving 69 percent accuracy compared to 54 percent for logistic regression and 29 percent for the strongest proprietary LLM baseline.

Paper Structure

This paper contains 30 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Overview of the Harbor Risk Score (HRS) scale, interpretability design, and calibration concept. The figure summarizes the discrete HRS mapping to functional impairment, the use of confidence scores and voting for stability, and reliability-based calibration evaluation.
  • Figure 2: Prompt used for default evaluation of language models, including HARBOR and proprietary LLM baselines. Unicode minus signs are avoided for LaTeX compatibility.