Table of Contents
Fetching ...

A Multi-Agent Framework for Interpreting Multivariate Physiological Time Series

Davide Gabrielli, Paola Velardi, Stefano Faralli, Bardh Prenkaj

TL;DR

The findings show that the value of agentic AI lies in the selective externalization of computation and structure rather than in maximal reasoning complexity, and highlight concrete design trade-offs and learned lessons, broadly applicable to explainable AI in safety-critical healthcare settings.

Abstract

Continuous physiological monitoring is central to emergency care, yet deploying trustworthy AI is challenging. While LLMs can translate complex physiological signals into clinical narratives, it is unclear how agentic systems perform relative to zero-shot inference. To address these questions, we present Vivaldi, a role-structured multi-agent system that explains multivariate physiological time series. Due to regulatory constraints that preclude live deployment, we instantiate Vivaldi in a controlled, clinical pilot to a small, highly qualified cohort of emergency medicine experts, whose evaluations reveal a context-dependent picture that contrasts with prevailing assumptions that agentic reasoning uniformly improves performance. Our experiments show that agentic pipelines substantially benefit non-thinking and medically fine-tuned models, improving expert-rated explanation justification and relevance by +6.9 and +9.7 points, respectively. Contrarily, for thinking models, agentic orchestration often degrades explanation quality, including a 14-point drop in relevance, while improving diagnostic precision (ESI F1 +3.6). We also find that explicit tool-based computation is decisive for codifiable clinical metrics, whereas subjective targets, such as pain scores and length of stay, show limited or inconsistent changes. Expert evaluation further indicates that gains in clinical utility depend on visualization conventions, with medically specialized models achieving the most favorable trade-offs between utility and clarity. Together, these findings show that the value of agentic AI lies in the selective externalization of computation and structure rather than in maximal reasoning complexity, and highlight concrete design trade-offs and learned lessons, broadly applicable to explainable AI in safety-critical healthcare settings.

A Multi-Agent Framework for Interpreting Multivariate Physiological Time Series

TL;DR

The findings show that the value of agentic AI lies in the selective externalization of computation and structure rather than in maximal reasoning complexity, and highlight concrete design trade-offs and learned lessons, broadly applicable to explainable AI in safety-critical healthcare settings.

Abstract

Continuous physiological monitoring is central to emergency care, yet deploying trustworthy AI is challenging. While LLMs can translate complex physiological signals into clinical narratives, it is unclear how agentic systems perform relative to zero-shot inference. To address these questions, we present Vivaldi, a role-structured multi-agent system that explains multivariate physiological time series. Due to regulatory constraints that preclude live deployment, we instantiate Vivaldi in a controlled, clinical pilot to a small, highly qualified cohort of emergency medicine experts, whose evaluations reveal a context-dependent picture that contrasts with prevailing assumptions that agentic reasoning uniformly improves performance. Our experiments show that agentic pipelines substantially benefit non-thinking and medically fine-tuned models, improving expert-rated explanation justification and relevance by +6.9 and +9.7 points, respectively. Contrarily, for thinking models, agentic orchestration often degrades explanation quality, including a 14-point drop in relevance, while improving diagnostic precision (ESI F1 +3.6). We also find that explicit tool-based computation is decisive for codifiable clinical metrics, whereas subjective targets, such as pain scores and length of stay, show limited or inconsistent changes. Expert evaluation further indicates that gains in clinical utility depend on visualization conventions, with medically specialized models achieving the most favorable trade-offs between utility and clarity. Together, these findings show that the value of agentic AI lies in the selective externalization of computation and structure rather than in maximal reasoning complexity, and highlight concrete design trade-offs and learned lessons, broadly applicable to explainable AI in safety-critical healthcare settings.
Paper Structure (11 sections, 5 figures, 3 tables)

This paper contains 11 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: System architecture as a five-scene Emergency Department narrative. The area on the right represents the Shared Memory Buffer (SMB) that Vivaldi manages, orchestrating communication with it and the agents, or between agents, as an intermediary: e.g., think of this as API calls where read (dashed lines) and write (dotted lines) operations happen in the SMB. The circles at the edge of the shaded area on the right illustrate the agents' entry point for communicating with Vivaldi. We remove the details of read/write interactions between the SMB and Vivaldi for illustration purposes, and defer them to the single scenes. Same-colored lines represent a single logical flow whose completion might depend on multiple interactions between Vivaldi and the agents therein (e.g., 5.b, 8, 9, and 10).
  • Figure 2: ESI Level Confusion Matrices. Zero-shot models (left) show a critical failure to detect Level 1 (highest acuity) emergencies, misclassifying all such cases.
  • Figure 3: Performance shifts in Chart Comprehensibility (left) and Clinical Utility (right) across models. Each arrow represents the transition from zero-shot (open circle) to the agentic workflow (solid circle). Green arrows indicate performance gains; red, degradations.
  • Figure 4: Time efficiency analysis of Zero-Shot versus Agentic pipelines. Stacked bars illustrate the temporal distribution of agents, each annotated with its share of total execution time. Gray bars represent baseline Zero-Shot latency.
  • Figure 5: Prompt and Completion token usage by model.