Table of Contents
Fetching ...

Conversational LLMs Simplify Secure Clinical Data Access, Understanding, and Analysis

Rafi Al Attrach, Pedro Moreira, Rajna Fani, Renato Umeton, Amelia Fiske, Leo Anthony Celi

TL;DR

The paper addresses the barrier to leveraging large-scale clinical data by introducing M3, a Model Context Protocol-based system that enables natural language querying of MIMIC-IV with auditable SQL. It implements a dual-backend architecture (local SQLite and Google BigQuery) and security safeguards (OAuth2, query validation, audit logging) to provide privacy-preserving access. The approach is evaluated on 100 EHRSQL-2024 queries using Claude Sonnet 4 and gpt-oss-20B, achieving 94% and 93% accuracy respectively, demonstrating robust NL-to-SQL translation without fine-tuning. The work shows that secure, interpretable, and accessible clinical data analysis is feasible on local hardware and outlines a pragmatic roadmap to broaden dataset coverage, expand MCP tooling, and foster community contributions with governance and safety at the forefront.

Abstract

Large-scale clinical databases offer opportunities for medical research, but their complexity creates barriers to effective use. The Medical Information Mart for Intensive Care (MIMIC-IV), one of the world's largest open-source electronic health record databases, traditionally requires both SQL proficiency and clinical domain expertise. We introduce M3, a system that enables natural language querying of MIMIC-IV data through the Model Context Protocol. With a single command, M3 retrieves MIMIC-IV from PhysioNet, launches a local SQLite instance or connects to hosted BigQuery, and allows researchers to pose clinical questions in plain English. We evaluated M3 using one hundred questions from the EHRSQL 2024 benchmark with two language models: the proprietary Claude Sonnet 4 achieved 94% accuracy, while the open-source gpt-oss-20B (deployable locally on consumer hardware) achieved 93% accuracy. Both models translate natural language into SQL, execute queries against MIMIC-IV, and return structured results alongside the underlying query for verification. Error analysis revealed that most failures stemmed from complex temporal reasoning or ambiguous question phrasing rather than fundamental architectural limitations. The comparable performance of a smaller open-source model demonstrates that privacy-preserving local deployment is viable for sensitive clinical data analysis. M3 lowers technical barriers to critical care data analysis while maintaining security through OAuth2 authentication, query validation, and comprehensive audit logging.

Conversational LLMs Simplify Secure Clinical Data Access, Understanding, and Analysis

TL;DR

The paper addresses the barrier to leveraging large-scale clinical data by introducing M3, a Model Context Protocol-based system that enables natural language querying of MIMIC-IV with auditable SQL. It implements a dual-backend architecture (local SQLite and Google BigQuery) and security safeguards (OAuth2, query validation, audit logging) to provide privacy-preserving access. The approach is evaluated on 100 EHRSQL-2024 queries using Claude Sonnet 4 and gpt-oss-20B, achieving 94% and 93% accuracy respectively, demonstrating robust NL-to-SQL translation without fine-tuning. The work shows that secure, interpretable, and accessible clinical data analysis is feasible on local hardware and outlines a pragmatic roadmap to broaden dataset coverage, expand MCP tooling, and foster community contributions with governance and safety at the forefront.

Abstract

Large-scale clinical databases offer opportunities for medical research, but their complexity creates barriers to effective use. The Medical Information Mart for Intensive Care (MIMIC-IV), one of the world's largest open-source electronic health record databases, traditionally requires both SQL proficiency and clinical domain expertise. We introduce M3, a system that enables natural language querying of MIMIC-IV data through the Model Context Protocol. With a single command, M3 retrieves MIMIC-IV from PhysioNet, launches a local SQLite instance or connects to hosted BigQuery, and allows researchers to pose clinical questions in plain English. We evaluated M3 using one hundred questions from the EHRSQL 2024 benchmark with two language models: the proprietary Claude Sonnet 4 achieved 94% accuracy, while the open-source gpt-oss-20B (deployable locally on consumer hardware) achieved 93% accuracy. Both models translate natural language into SQL, execute queries against MIMIC-IV, and return structured results alongside the underlying query for verification. Error analysis revealed that most failures stemmed from complex temporal reasoning or ambiguous question phrasing rather than fundamental architectural limitations. The comparable performance of a smaller open-source model demonstrates that privacy-preserving local deployment is viable for sensitive clinical data analysis. M3 lowers technical barriers to critical care data analysis while maintaining security through OAuth2 authentication, query validation, and comprehensive audit logging.

Paper Structure

This paper contains 36 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Results of a complex query, described in natural language as "Among patients who were diagnosed with anemia, unspecified since 2100, what are the top three most commonly prescribed medications that followed during the same hospital visit for patients in their 60 or above?"
  • Figure 2: Conceptual Diagram of the M3 System Architecture
  • Figure 3: Query: “Show trends in systolic blood pressure for patients on vasopressors within 48 hours of ICU admission.”
  • Figure 4: Query: “Among sepsis patients, what’s the source‑of‑infection distribution and how do groups differ in ICU stay and mortality?”