HARBOR: Holistic Adaptive Risk assessment model for BehaviORal healthcare
Aditya Siddhant
TL;DR
This work addresses the challenge of producing calibrated, discrete mood/risk scores from multimodal and longitudinal behavioral health data. It introduces PEARL, a four-year, monthly longitudinal dataset across three patients, and HARBOR, a domain-adapted behavioral health–aware LLM trained to predict the Harbor Risk Score (HRS) on a -3 to +3 scale through mid-training, supervised fine-tuning, reinforcement learning, and Self-Taught Reasoning. Across multiple baselines and ablations, HARBOR achieves substantially higher accuracy (0.69) and stronger association with ground-truth mood trajectories than traditional models and off-the-shelf LLMs, while providing calibrated confidence estimates and interpretability aligned with psychiatric practice. The results support the potential of clinically grounded LLMs as decision-support tools in behavioral healthcare, with emphasis on safety, calibration, and robustness to distributional shifts, and point to future work expanding dataset scope and temporal granularity.
Abstract
Behavioral healthcare risk assessment remains a challenging problem due to the highly multimodal nature of patient data and the temporal dynamics of mood and affective disorders. While large language models (LLMs) have demonstrated strong reasoning capabilities, their effectiveness in structured clinical risk scoring remains unclear. In this work, we introduce HARBOR, a behavioral health aware language model designed to predict a discrete mood and risk score, termed the Harbor Risk Score (HRS), on an integer scale from -3 (severe depression) to +3 (mania). We also release PEARL, a longitudinal behavioral healthcare dataset spanning four years of monthly observations from three patients, containing physiological, behavioral, and self reported mental health signals. We benchmark traditional machine learning models, proprietary LLMs, and HARBOR across multiple evaluation settings and ablations. Our results show that HARBOR outperforms classical baselines and off the shelf LLMs, achieving 69 percent accuracy compared to 54 percent for logistic regression and 29 percent for the strongest proprietary LLM baseline.
