Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications
Jean-Philippe Corbeil, Asma Ben Abacha, George Michalopoulos, Phillip Swazinna, Miguel Del-Agua, Jerome Tremblay, Akila Jeeson Daniel, Cari Bader, Yu-Cheng Cho, Pooja Krishnan, Nathan Bodenstab, Thomas Lin, Wenxuan Teng, Francois Beaulieu, Paul Vozila
TL;DR
This work tackles two high-impact clinical NLP tasks—structured nursing observation extraction from nurse dictations and medical order extraction from doctor–patient conversations—due to data scarcity and sensitivity. It evaluates both open- and closed-weight LLMs on proprietary hospital data and newly released open datasets, and introduces an agentic data-generation pipeline to create realistic, non-sensitive nursing data (SYNUR) alongside the first medical order extraction dataset (SIMORD). The authors demonstrate that LLMs can reduce documentation burden, analyze the relative strengths of different model families, and highlight practical challenges such as long-context handling and output parsing. The release of SYNUR and SIMORD provides valuable resources for ongoing research, while the study outlines a path toward scalable, LLM-driven clinical documentation solutions, with clear avenues for future work including larger datasets and constrained-output improvements.
Abstract
Large language models (LLMs) such as GPT-4o and o1 have demonstrated strong performance on clinical natural language processing (NLP) tasks across multiple medical benchmarks. Nonetheless, two high-impact NLP tasks - structured tabular reporting from nurse dictations and medical order extraction from doctor-patient consultations - remain underexplored due to data scarcity and sensitivity, despite active industry efforts. Practical solutions to these real-world clinical tasks can significantly reduce the documentation burden on healthcare providers, allowing greater focus on patient care. In this paper, we investigate these two challenging tasks using private and open-source clinical datasets, evaluating the performance of both open- and closed-weight LLMs, and analyzing their respective strengths and limitations. Furthermore, we propose an agentic pipeline for generating realistic, non-sensitive nurse dictations, enabling structured extraction of clinical observations. To support further research in both areas, we release SYNUR and SIMORD, the first open-source datasets for nurse observation extraction and medical order extraction.
