Table of Contents
Fetching ...

HEARTS: Benchmarking LLM Reasoning on Health Time Series

Sirui Li, Shuhan Xiao, Mihir Joshi, Ahmed Metwally, Daniel McDuff, Wei Wang, Yuzhe Yang

TL;DR

HEARTS (Health Reasoning over Time Series), a unified benchmark for evaluating hierarchical reasoning capabilities of LLMs over general health time series, provides a standardized testbed and living benchmark for developing next-generation LLM agents capable of reasoning over diverse health signals.

Abstract

The rise of large language models (LLMs) has shifted time series analysis from narrow analytics to general-purpose reasoning. Yet, existing benchmarks cover only a small set of health time series modalities and tasks, failing to reflect the diverse domains and extensive temporal dependencies inherent in real-world physiological modeling. To bridge these gaps, we introduce HEARTS (Health Reasoning over Time Series), a unified benchmark for evaluating hierarchical reasoning capabilities of LLMs over general health time series. HEARTS integrates 16 real-world datasets across 12 health domains and 20 signal modalities, and defines a comprehensive taxonomy of 110 tasks grouped into four core capabilities: Perception, Inference, Generation, and Deduction. Evaluating 14 state-of-the-art LLMs on more than 20K test samples reveals intriguing findings. First, LLMs substantially underperform specialized models, and their performance is only weakly related to general reasoning scores. Moreover, LLMs often rely on simple heuristics and struggle with multi-step temporal reasoning. Finally, performance declines with increasing temporal complexity, with similar failure modes within model families, indicating that scaling alone is insufficient. By making these gaps measurable, HEARTS provides a standardized testbed and living benchmark for developing next-generation LLM agents capable of reasoning over diverse health signals.

HEARTS: Benchmarking LLM Reasoning on Health Time Series

TL;DR

HEARTS (Health Reasoning over Time Series), a unified benchmark for evaluating hierarchical reasoning capabilities of LLMs over general health time series, provides a standardized testbed and living benchmark for developing next-generation LLM agents capable of reasoning over diverse health signals.

Abstract

The rise of large language models (LLMs) has shifted time series analysis from narrow analytics to general-purpose reasoning. Yet, existing benchmarks cover only a small set of health time series modalities and tasks, failing to reflect the diverse domains and extensive temporal dependencies inherent in real-world physiological modeling. To bridge these gaps, we introduce HEARTS (Health Reasoning over Time Series), a unified benchmark for evaluating hierarchical reasoning capabilities of LLMs over general health time series. HEARTS integrates 16 real-world datasets across 12 health domains and 20 signal modalities, and defines a comprehensive taxonomy of 110 tasks grouped into four core capabilities: Perception, Inference, Generation, and Deduction. Evaluating 14 state-of-the-art LLMs on more than 20K test samples reveals intriguing findings. First, LLMs substantially underperform specialized models, and their performance is only weakly related to general reasoning scores. Moreover, LLMs often rely on simple heuristics and struggle with multi-step temporal reasoning. Finally, performance declines with increasing temporal complexity, with similar failure modes within model families, indicating that scaling alone is insufficient. By making these gaps measurable, HEARTS provides a standardized testbed and living benchmark for developing next-generation LLM agents capable of reasoning over diverse health signals.
Paper Structure (56 sections, 18 figures, 7 tables)

This paper contains 56 sections, 18 figures, 7 tables.

Figures (18)

  • Figure 1: Overview of HeaRTS. We present the first diverse benchmark for health time-series reasoning, encompassing 20 signal modalities spanning 16 datasets and 12 health domains, with to date the broadest coverage of sequence length, frequency, and time span. It comprises over 20K test samples across 110 tasks, organized into four reasoning categories. More details are in Appendix \ref{['app_subsec:task_design_detail']} and \ref{['app_subsec:example_prompt']}.
  • Figure 2: Task and domain distributions in HeaRTS.
  • Figure 3: Regression analysis of intelligence index vs. performance on HeaRTS. Models performing below the naive baseline are excluded as outliers.
  • Figure 4: Performance comparison of state-of-the-art ML methods and LLMs. Density reflects the concentration of tasks at each performance level.
  • Figure 5: Regression analysis of performance gain vs. data complexity. Each point represents the 14-model average Kappa.
  • ...and 13 more figures