Table of Contents
Fetching ...

ChatSchema: A pipeline of extracting structured information with Large Multimodal Models based on schema

Fei Wang, Yuewen Zheng, Qin Li, Jingyi Wu, Pengfei Li, Luxia Zhang

TL;DR

ChatSchema presents a two-stage pipeline that leverages Large Multimodal Models and OCR to extract structured information from medical reports guided by a predefined schema. It combines report scenario classification with schema-aligned information extraction through prompt engineering and privacy-preserving desensitization, enabling direct data entry. On 100 PKU First Hospital reports, it achieves high key-value extraction performance (key-precision 98.6%, key-recall 98.5%, key-F1 98.6%) and value extraction accuracy of 97.2% (precision/recall/F1 95.8%), outperforming a Baseline by substantial margins. The results demonstrate robust, schema-driven information extraction across different LMMs, while highlighting OCR errors and key-schema alignment as areas for further improvement and expansion to more diverse datasets and languages.

Abstract

Objective: This study introduces ChatSchema, an effective method for extracting and structuring information from unstructured data in medical paper reports using a combination of Large Multimodal Models (LMMs) and Optical Character Recognition (OCR) based on the schema. By integrating predefined schema, we intend to enable LMMs to directly extract and standardize information according to the schema specifications, facilitating further data entry. Method: Our approach involves a two-stage process, including classification and extraction for categorizing report scenarios and structuring information. We established and annotated a dataset to verify the effectiveness of ChatSchema, and evaluated key extraction using precision, recall, F1-score, and accuracy metrics. Based on key extraction, we further assessed value extraction. We conducted ablation studies on two LMMs to illustrate the improvement of structured information extraction with different input modals and methods. Result: We analyzed 100 medical reports from Peking University First Hospital and established a ground truth dataset with 2,945 key-value pairs. We evaluated ChatSchema using GPT-4o and Gemini 1.5 Pro and found a higher overall performance of GPT-4o. The results are as follows: For the result of key extraction, key-precision was 98.6%, key-recall was 98.5%, key-F1-score was 98.6%. For the result of value extraction based on correct key extraction, the overall accuracy was 97.2%, precision was 95.8%, recall was 95.8%, and F1-score was 95.8%. An ablation study demonstrated that ChatSchema achieved significantly higher overall accuracy and overall F1-score of key-value extraction, compared to the Baseline, with increases of 26.9% overall accuracy and 27.4% overall F1-score, respectively.

ChatSchema: A pipeline of extracting structured information with Large Multimodal Models based on schema

TL;DR

ChatSchema presents a two-stage pipeline that leverages Large Multimodal Models and OCR to extract structured information from medical reports guided by a predefined schema. It combines report scenario classification with schema-aligned information extraction through prompt engineering and privacy-preserving desensitization, enabling direct data entry. On 100 PKU First Hospital reports, it achieves high key-value extraction performance (key-precision 98.6%, key-recall 98.5%, key-F1 98.6%) and value extraction accuracy of 97.2% (precision/recall/F1 95.8%), outperforming a Baseline by substantial margins. The results demonstrate robust, schema-driven information extraction across different LMMs, while highlighting OCR errors and key-schema alignment as areas for further improvement and expansion to more diverse datasets and languages.

Abstract

Objective: This study introduces ChatSchema, an effective method for extracting and structuring information from unstructured data in medical paper reports using a combination of Large Multimodal Models (LMMs) and Optical Character Recognition (OCR) based on the schema. By integrating predefined schema, we intend to enable LMMs to directly extract and standardize information according to the schema specifications, facilitating further data entry. Method: Our approach involves a two-stage process, including classification and extraction for categorizing report scenarios and structuring information. We established and annotated a dataset to verify the effectiveness of ChatSchema, and evaluated key extraction using precision, recall, F1-score, and accuracy metrics. Based on key extraction, we further assessed value extraction. We conducted ablation studies on two LMMs to illustrate the improvement of structured information extraction with different input modals and methods. Result: We analyzed 100 medical reports from Peking University First Hospital and established a ground truth dataset with 2,945 key-value pairs. We evaluated ChatSchema using GPT-4o and Gemini 1.5 Pro and found a higher overall performance of GPT-4o. The results are as follows: For the result of key extraction, key-precision was 98.6%, key-recall was 98.5%, key-F1-score was 98.6%. For the result of value extraction based on correct key extraction, the overall accuracy was 97.2%, precision was 95.8%, recall was 95.8%, and F1-score was 95.8%. An ablation study demonstrated that ChatSchema achieved significantly higher overall accuracy and overall F1-score of key-value extraction, compared to the Baseline, with increases of 26.9% overall accuracy and 27.4% overall F1-score, respectively.
Paper Structure (11 sections, 7 equations, 8 figures, 2 tables)

This paper contains 11 sections, 7 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of ChatSchema.
  • Figure 2: Details of report scenarios classification stage.
  • Figure 3: Detail of image preprocessing flow.
  • Figure 4: Details of report information extraction stage.
  • Figure 5: Comparison of Baseline and ChatSchema with three different inputs on GPT-4o.
  • ...and 3 more figures