Table of Contents
Fetching ...

ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents

Chao Li, Cailiang Liu, Ang Gao, Kexin Deng, Shu Zhang, Langping Xu, Xiaotong Shi, Xionghao Ding, Jian Pei, Xun Jiang

Abstract

Longitudinal health agents must reason across multi-source trajectories that combine continuous device streams, sparse clinical exams, and episodic life events - yet evaluating them is hard: real-world data cannot be released at scale, and temporally grounded attribution questions seldom admit definitive answers without structured ground truth. We present ESL-Bench, an event-driven synthesis framework and benchmark providing 100 synthetic users, each with a 1-5 year trajectory comprising a health profile, a multi-phase narrative plan, daily device measurements, periodic exam records, and an event log with explicit per-indicator impact parameters. Each indicator follows a baseline stochastic process driven by discrete events with sigmoid-onset, exponential-decay kernels under saturation and projection constraints; a hybrid pipeline delegates sparse semantic artifacts to LLM-based planning and dense indicator dynamics to algorithmic simulation with hard physiological bounds. Users are each paired with 100 evaluation queries across five dimensions - Lookup, Trend, Comparison, Anomaly, Explanation - stratified into Easy, Medium, and Hard tiers, with all ground-truth answers programmatically computable from the recorded event-indicator relationships. Evaluating 13 methods spanning LLMs with tools, DB-native agents, and memory-augmented RAG, we find that DB agents (48-58%) substantially outperform memory RAG baselines (30-38%), with the gap concentrated on Comparison and Explanation queries where multi-hop reasoning and evidence attribution are required.

ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents

Abstract

Longitudinal health agents must reason across multi-source trajectories that combine continuous device streams, sparse clinical exams, and episodic life events - yet evaluating them is hard: real-world data cannot be released at scale, and temporally grounded attribution questions seldom admit definitive answers without structured ground truth. We present ESL-Bench, an event-driven synthesis framework and benchmark providing 100 synthetic users, each with a 1-5 year trajectory comprising a health profile, a multi-phase narrative plan, daily device measurements, periodic exam records, and an event log with explicit per-indicator impact parameters. Each indicator follows a baseline stochastic process driven by discrete events with sigmoid-onset, exponential-decay kernels under saturation and projection constraints; a hybrid pipeline delegates sparse semantic artifacts to LLM-based planning and dense indicator dynamics to algorithmic simulation with hard physiological bounds. Users are each paired with 100 evaluation queries across five dimensions - Lookup, Trend, Comparison, Anomaly, Explanation - stratified into Easy, Medium, and Hard tiers, with all ground-truth answers programmatically computable from the recorded event-indicator relationships. Evaluating 13 methods spanning LLMs with tools, DB-native agents, and memory-augmented RAG, we find that DB agents (48-58%) substantially outperform memory RAG baselines (30-38%), with the gap concentrated on Comparison and Explanation queries where multi-hop reasoning and evidence attribution are required.

Paper Structure

This paper contains 71 sections, 9 equations, 4 figures, 11 tables, 1 algorithm.

Figures (4)

  • Figure 1: Structure and evaluation design of the event-driven synthetic longitudinal benchmark.
  • Figure 2: Internal probabilistic modeling architecture. Blue nodes denote LLM-driven conditional sampling; green nodes denote algorithmic simulation governed by \ref{['eq:dynamics', 'eq:kernel', 'eq:exam_anchor']}; orange nodes denote validation mechanisms. Each LLM call conditions on a narrow context (short history window, single event type) to estimate a low-dimensional conditional distribution, while the simulator computes dense daily dynamics deterministically. The three-level validation layer covers expert review of event-impact templates, population-level marginal calibration against published norms, and per-indicator conformance auditing.
  • Figure 3: Hybrid generation pipeline. LLM modules handle sparse semantic decisions (profiles, trajectory plan, event narratives, exam metadata), while algorithmic simulation produces daily device indicators under explicit dynamics and deterministic constraints. (1) Initialization with Profile Generation and Indicator Selection; (2) Trajectory Planning that produces a multi-phase narrative arc; (3) Daily Loop with Event Decision (LLM + trajectory context + sparsity gate), Device Indicator Simulator (algorithmic), and Exam Generation (LLM + deterministic anchoring); (4) Export producing structured artifacts.
  • Figure 4: Four-month trajectory excerpt for one synthetic user showing four device indicators (Daily Stress Score, Resting Heart Rate, Total Sleep Time, Daily Step Count) with nine labeled life events. Shaded regions mark active event periods; blue indicates a beneficial effect on that indicator, red an adverse effect, and gray no effect---so the same event may appear in different colors across panels (e.g., "Indoor VR fitness routines" is blue for Stress/HR but red for Sleep/Steps). The black line is the 7-day rolling mean; the orange dotted line marks each indicator's personalized baseline; the green dashed line marks an exam visit. Only the most prominent events are labeled; additional short-term events (e.g., acute anxiety episodes, OTC medication use) also contribute to the observed fluctuations.