Table of Contents
Fetching ...

ACI-BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, Meliha Yetisgen

TL;DR

ACI-BENCH introduces the largest public corpus for model-assisted visit-note generation from doctor–patient dialogue, addressing the need for open benchmarks in clinical NLP. It implements three realistic data-generation modes and a SOAP-aligned four-division note structure, paired with thorough content validation and ASR variation analysis. The paper benchmarks a wide spectrum of models, including retrieval baselines, BART/LED variants, BioBART, and OpenAI APIs, revealing that division-based generation often outperforms full-note generation and that GPT-4 and other strong LLMs achieve competitive medcon and Rouge scores. The dataset enables reproducible benchmarking, cross-model comparisons, and development of evaluation metrics tailored to clinical note generation, with practical implications for advancing ambient clinical intelligence while highlighting current limitations. Overall, ACI-BENCH provides a rigorous, publicly available benchmark to drive methodological progress in AI-assisted clinical documentation.

Abstract

Recent immense breakthroughs in generative models such as in GPT4 have precipitated re-imagined ubiquitous usage of these models in all applications. One area that can benefit by improvements in artificial intelligence (AI) is healthcare. The note generation task from doctor-patient encounters, and its associated electronic medical record documentation, is one of the most arduous time-consuming tasks for physicians. It is also a natural prime potential beneficiary to advances in generative models. However with such advances, benchmarking is more critical than ever. Whether studying model weaknesses or developing new evaluation metrics, shared open datasets are an imperative part of understanding the current state-of-the-art. Unfortunately as clinic encounter conversations are not routinely recorded and are difficult to ethically share due to patient confidentiality, there are no sufficiently large clinic dialogue-note datasets to benchmark this task. Here we present the Ambient Clinical Intelligence Benchmark (ACI-BENCH) corpus, the largest dataset to date tackling the problem of AI-assisted note generation from visit dialogue. We also present the benchmark performances of several common state-of-the-art approaches.

ACI-BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation

TL;DR

ACI-BENCH introduces the largest public corpus for model-assisted visit-note generation from doctor–patient dialogue, addressing the need for open benchmarks in clinical NLP. It implements three realistic data-generation modes and a SOAP-aligned four-division note structure, paired with thorough content validation and ASR variation analysis. The paper benchmarks a wide spectrum of models, including retrieval baselines, BART/LED variants, BioBART, and OpenAI APIs, revealing that division-based generation often outperforms full-note generation and that GPT-4 and other strong LLMs achieve competitive medcon and Rouge scores. The dataset enables reproducible benchmarking, cross-model comparisons, and development of evaluation metrics tailored to clinical note generation, with practical implications for advancing ambient clinical intelligence while highlighting current limitations. Overall, ACI-BENCH provides a rigorous, publicly available benchmark to drive methodological progress in AI-assisted clinical documentation.

Abstract

Recent immense breakthroughs in generative models such as in GPT4 have precipitated re-imagined ubiquitous usage of these models in all applications. One area that can benefit by improvements in artificial intelligence (AI) is healthcare. The note generation task from doctor-patient encounters, and its associated electronic medical record documentation, is one of the most arduous time-consuming tasks for physicians. It is also a natural prime potential beneficiary to advances in generative models. However with such advances, benchmarking is more critical than ever. Whether studying model weaknesses or developing new evaluation metrics, shared open datasets are an imperative part of understanding the current state-of-the-art. Unfortunately as clinic encounter conversations are not routinely recorded and are difficult to ethically share due to patient confidentiality, there are no sufficiently large clinic dialogue-note datasets to benchmark this task. Here we present the Ambient Clinical Intelligence Benchmark (ACI-BENCH) corpus, the largest dataset to date tackling the problem of AI-assisted note generation from visit dialogue. We also present the benchmark performances of several common state-of-the-art approaches.
Paper Structure (27 sections, 2 figures, 25 tables)

This paper contains 27 sections, 2 figures, 25 tables.

Figures (2)

  • Figure 1: Note division example. The same content in a clinical note can appear under different sections. As an example, in the left note, "past medical history" contents are written in the "history" portion of the note on the right. To seperate the full note target into smaller text and minimize data sparsity problems if modeling by individual sections, notes are partitioned into separate subjective, objective_exam, objective_results, and assessment_and_plan continuous divisions. This also allows evaluation and generation at a higher granularity compared to a full note level.
  • Figure 2: BERT subtoken lengths of concatenated gold/system summaries (test1 Text-davinci-003 system) for doctor-patient dialogue to clinical note generation task. As embedding-based models require encoding the concatenated reference and hypothesis, on this dataset it would be difficult to fairly evaluate the corpus using current pretrained BERT models which have a 512 subtoken limit.