Table of Contents
Fetching ...

PARHAF, a human-authored corpus of clinical reports for fictitious patients in French

Xavier Tannier, Salam Abbara, Rémi Flicoteaux, Youness Khalil, Aurélie Névéol, Pierre Zweigenbaum, Emmanuel Bacry

Abstract

The development of clinical natural language processing (NLP) systems is severely hampered by the sensitive nature of medical records, which restricts data sharing under stringent privacy regulations, particularly in France and the broader European Union. To address this gap, we introduce PARHAF, a large open-source corpus of clinical documents in French. PARHAF comprises expert-authored clinical reports describing realistic yet entirely fictitious patient cases, making it anonymous and freely shareable by design. The corpus was developed using a structured protocol that combined clinician expertise with epidemiological guidance from the French National Health Data System (SNDS), ensuring broad clinical coverage. A total of 104 medical residents across 18 specialties authored and peer-reviewed the reports following predefined clinical scenarios and document templates. The corpus contains 7394 clinical reports covering 5009 patient cases across a wide range of medical and surgical specialties. It includes a general-purpose component designed to approximate real-world hospitalization distributions, and four specialized subsets that support information-extraction use cases in oncology, infectious diseases, and diagnostic coding. Documents are released under a CC-BY open license, with a portion temporarily embargoed to enable future benchmarking under controlled conditions. PARHAF provides a valuable resource for training and evaluating French clinical language models in a fully privacy-preserving setting, and establishes a replicable methodology for building shareable synthetic clinical corpora in other languages and health systems.

PARHAF, a human-authored corpus of clinical reports for fictitious patients in French

Abstract

The development of clinical natural language processing (NLP) systems is severely hampered by the sensitive nature of medical records, which restricts data sharing under stringent privacy regulations, particularly in France and the broader European Union. To address this gap, we introduce PARHAF, a large open-source corpus of clinical documents in French. PARHAF comprises expert-authored clinical reports describing realistic yet entirely fictitious patient cases, making it anonymous and freely shareable by design. The corpus was developed using a structured protocol that combined clinician expertise with epidemiological guidance from the French National Health Data System (SNDS), ensuring broad clinical coverage. A total of 104 medical residents across 18 specialties authored and peer-reviewed the reports following predefined clinical scenarios and document templates. The corpus contains 7394 clinical reports covering 5009 patient cases across a wide range of medical and surgical specialties. It includes a general-purpose component designed to approximate real-world hospitalization distributions, and four specialized subsets that support information-extraction use cases in oncology, infectious diseases, and diagnostic coding. Documents are released under a CC-BY open license, with a portion temporarily embargoed to enable future benchmarking under controlled conditions. PARHAF provides a valuable resource for training and evaluating French clinical language models in a fully privacy-preserving setting, and establishes a replicable methodology for building shareable synthetic clinical corpora in other languages and health systems.
Paper Structure (24 sections, 1 equation, 6 figures, 1 table)

This paper contains 24 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: High-level structure of the dataset (JSON format)
  • Figure 2: Patient count per specialty. Specific documents, off distribution, were written for four use cases. A subset of the dataset is temporarily embargoed to enable future evaluations under controlled conditions, thereby limiting the risk of large language model contamination through prior exposure to the data. In total, we release 4254 patients (6185 documents) and keep 755 patients (1209 documents) under embargo for future release.
  • Figure 3: PARHAF population pyramid chart.
  • Figure 4: Document and word count by document type. The number of documents to write for a patient was imposed by the initial scenario. Word count is computed by the library edsnlp (v0.20).
  • Figure S1: Change in distribution for each specialty, after the adjustment described in Section \ref{['sec:core_distribution']}.
  • ...and 1 more figures