Table of Contents
Fetching ...

A global log for medical AI

Ayush Noori, Adam Rodman, Alan Karthikesalingam, Bilal A. Mateen, Christopher A. Longhurst, Daniel Yang, Dave deBronkart, Gauden Galea, Harold F. Wolf, Jacob Waxman, Joshua C. Mandel, Juliana Rotich, Kenneth D. Mandl, Maryam Mustafa, Melissa Miles, Nigam H. Shah, Peter Lee, Robert Korom, Scott Mahoney, Seth Hain, Tien Yin Wong, Trevor Mundel, Vivek Natarajan, Noa Dagan, David A. Clifton, Ran D. Balicer, Isaac S. Kohane, Marinka Zitnik

TL;DR

The paper addresses the lack of standardized, event-level logging for medical AI deployments and argues this gap hampers safety, accountability, and continuous improvement. It introduces MedLog, a nine-field logging protocol that captures comprehensive context for each AI interaction, enabling real-time surveillance, bias and shift detection, and post-market oversight. The authors outline privacy, data management, deployment pathways, governance, and global adoption considerations, asserting that standardized logging can support digital epidemiology of AI usage and international benchmarking. They connect MedLog to established standards (PROV, OpenTelemetry, FHIR), showcase a Clalit case study, and provide code and prototypes to catalyze community adoption and interoperability.

Abstract

Modern computer systems often rely on syslog, a simple, universal protocol that records every critical event across heterogeneous infrastructure. However, healthcare's rapidly growing clinical AI stack has no equivalent. As hospitals rush to pilot large language models and other AI-based clinical decision support tools, we still lack a standard way to record how, when, by whom, and for whom these AI models are used. Without that transparency and visibility, it is challenging to measure real-world performance and outcomes, detect adverse events, or correct bias or dataset drift. In the spirit of syslog, we introduce MedLog, a protocol for event-level logging of clinical AI. Any time an AI model is invoked to interact with a human, interface with another algorithm, or act independently, a MedLog record is created. This record consists of nine core fields: header, model, user, target, inputs, artifacts, outputs, outcomes, and feedback, providing a structured and consistent record of model activity. To encourage early adoption, especially in low-resource settings, and minimize the data footprint, MedLog supports risk-based sampling, lifecycle-aware retention policies, and write-behind caching; detailed traces for complex, agentic, or multi-stage workflows can also be captured under MedLog. MedLog can catalyze the development of new databases and software to store and analyze MedLog records. Realizing this vision would enable continuous surveillance, auditing, and iterative improvement of medical AI, laying the foundation for a new form of digital epidemiology.

A global log for medical AI

TL;DR

The paper addresses the lack of standardized, event-level logging for medical AI deployments and argues this gap hampers safety, accountability, and continuous improvement. It introduces MedLog, a nine-field logging protocol that captures comprehensive context for each AI interaction, enabling real-time surveillance, bias and shift detection, and post-market oversight. The authors outline privacy, data management, deployment pathways, governance, and global adoption considerations, asserting that standardized logging can support digital epidemiology of AI usage and international benchmarking. They connect MedLog to established standards (PROV, OpenTelemetry, FHIR), showcase a Clalit case study, and provide code and prototypes to catalyze community adoption and interoperability.

Abstract

Modern computer systems often rely on syslog, a simple, universal protocol that records every critical event across heterogeneous infrastructure. However, healthcare's rapidly growing clinical AI stack has no equivalent. As hospitals rush to pilot large language models and other AI-based clinical decision support tools, we still lack a standard way to record how, when, by whom, and for whom these AI models are used. Without that transparency and visibility, it is challenging to measure real-world performance and outcomes, detect adverse events, or correct bias or dataset drift. In the spirit of syslog, we introduce MedLog, a protocol for event-level logging of clinical AI. Any time an AI model is invoked to interact with a human, interface with another algorithm, or act independently, a MedLog record is created. This record consists of nine core fields: header, model, user, target, inputs, artifacts, outputs, outcomes, and feedback, providing a structured and consistent record of model activity. To encourage early adoption, especially in low-resource settings, and minimize the data footprint, MedLog supports risk-based sampling, lifecycle-aware retention policies, and write-behind caching; detailed traces for complex, agentic, or multi-stage workflows can also be captured under MedLog. MedLog can catalyze the development of new databases and software to store and analyze MedLog records. Realizing this vision would enable continuous surveillance, auditing, and iterative improvement of medical AI, laying the foundation for a new form of digital epidemiology.

Paper Structure

This paper contains 16 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: (a) Examples of clinical AI interactions that will be logged under the MedLog protocol, as well as the MedLog records they would create. (b) Timeline demonstrating that MedLog records are progressively built from a stream of messages.
  • Figure 2: (a) Example patterns of model invocation and corresponding record creation in a MedLog implementation. De-identified MedLog records can be aggregated across healthcare systems to support downstream applications. (b)MedLog will transform medicine by enabling evaluation, auditing, and improvement of medical AI.
  • Figure 3: Density plots show the distribution of the "Lactate Dehydrogenase Last Value (LDH)" feature during the training period (January 2018), immediately after the test kit change (March 2023), and in subsequent quarterly snapshots through September 2024. The introduction of the new test kit caused a gradual shift in the distribution of LDH values, which was automatically detected by the AI monitoring system.