Table of Contents
Fetching ...

Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models

Mikko Saukkoriipi, Nicole Hernandez, Jaakko Sahlsten, Kimmo Kaski, Otso Arponen

Abstract

Clinicians often need to retrieve patient-specific information from electronic health records (EHRs), a task that is time-consuming and error-prone. We present a locally deployable Clinical Contextual Question Answering (CCQA) framework that answers clinical questions directly from EHRs without external data transfer. Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients. The dataset consisted predominantly of Finnish clinical text. In free-text generation, Llama-3.1-70B achieved 95.3% accuracy and 97.3% consistency across semantically equivalent question variants, while the smaller Qwen3-30B-A3B-2507 model achieved comparable performance. In a multiple-choice setting, models showed similar accuracy but variable calibration. Low-precision quantization (4-bit and 8-bit) preserved predictive performance while reducing GPU memory requirements and improving deployment feasibility. Clinical evaluation identified clinically significant errors in 2.9% of outputs, and semantically equivalent questions occasionally yielded discordant responses, including instances where one formulation was correct and the other contained a clinically significant error (0.96% of cases). These findings demonstrate that locally hosted open-source LLMs can accurately retrieve patient-specific information from EHRs using natural-language queries, while highlighting the need for validation and human oversight in clinical deployment.

Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models

Abstract

Clinicians often need to retrieve patient-specific information from electronic health records (EHRs), a task that is time-consuming and error-prone. We present a locally deployable Clinical Contextual Question Answering (CCQA) framework that answers clinical questions directly from EHRs without external data transfer. Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients. The dataset consisted predominantly of Finnish clinical text. In free-text generation, Llama-3.1-70B achieved 95.3% accuracy and 97.3% consistency across semantically equivalent question variants, while the smaller Qwen3-30B-A3B-2507 model achieved comparable performance. In a multiple-choice setting, models showed similar accuracy but variable calibration. Low-precision quantization (4-bit and 8-bit) preserved predictive performance while reducing GPU memory requirements and improving deployment feasibility. Clinical evaluation identified clinically significant errors in 2.9% of outputs, and semantically equivalent questions occasionally yielded discordant responses, including instances where one formulation was correct and the other contained a clinically significant error (0.96% of cases). These findings demonstrate that locally hosted open-source LLMs can accurately retrieve patient-specific information from EHRs using natural-language queries, while highlighting the need for validation and human oversight in clinical deployment.

Paper Structure

This paper contains 10 sections, 3 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Visual abstract of the Clinical Contextual Question Answering (CCQA) pipeline. A clinician submits a natural language question about a patient record. A long-context large language model processes the complete electronic health record and generates a context-aware answer, which is returned to the clinician for use in clinical practice.
  • Figure 2: Accuracy as a function of EHR length for the best-performing model from each model family. Results are shown across quartiles of EHR length (Q1–Q4), each containing 25% of the dataset. The x-axis represents the median EHR length (tokens) within each quartile, and the y-axis shows average accuracy. Accuracy values reported next to each model correspond to performance in the longest-length quartile (Q4).
  • Figure 3: GPU memory usage as a function of input token length for bfloat16 precision. The dashed horizontal line indicates the 160 GB memory budget used in the main experiments.
  • Figure 4: GPU memory usage as a function of input token length for 8-bit quantization. The dashed horizontal line indicates the 160 GB memory budget used in the main experiments. Llama-4 and GPT-OSS models are not included because these architectures are not compatible with bitsandbytes quantization.
  • Figure 5: GPU memory usage as a function of input token length for 4-bit quantization. The dashed horizontal line indicates the 160 GB memory budget used in the main experiments. Llama-4 and GPT-OSS models are not included because these architectures are not compatible with bitsandbytes quantization.