ACI-BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation
Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, Meliha Yetisgen
TL;DR
ACI-BENCH introduces the largest public corpus for model-assisted visit-note generation from doctor–patient dialogue, addressing the need for open benchmarks in clinical NLP. It implements three realistic data-generation modes and a SOAP-aligned four-division note structure, paired with thorough content validation and ASR variation analysis. The paper benchmarks a wide spectrum of models, including retrieval baselines, BART/LED variants, BioBART, and OpenAI APIs, revealing that division-based generation often outperforms full-note generation and that GPT-4 and other strong LLMs achieve competitive medcon and Rouge scores. The dataset enables reproducible benchmarking, cross-model comparisons, and development of evaluation metrics tailored to clinical note generation, with practical implications for advancing ambient clinical intelligence while highlighting current limitations. Overall, ACI-BENCH provides a rigorous, publicly available benchmark to drive methodological progress in AI-assisted clinical documentation.
Abstract
Recent immense breakthroughs in generative models such as in GPT4 have precipitated re-imagined ubiquitous usage of these models in all applications. One area that can benefit by improvements in artificial intelligence (AI) is healthcare. The note generation task from doctor-patient encounters, and its associated electronic medical record documentation, is one of the most arduous time-consuming tasks for physicians. It is also a natural prime potential beneficiary to advances in generative models. However with such advances, benchmarking is more critical than ever. Whether studying model weaknesses or developing new evaluation metrics, shared open datasets are an imperative part of understanding the current state-of-the-art. Unfortunately as clinic encounter conversations are not routinely recorded and are difficult to ethically share due to patient confidentiality, there are no sufficiently large clinic dialogue-note datasets to benchmark this task. Here we present the Ambient Clinical Intelligence Benchmark (ACI-BENCH) corpus, the largest dataset to date tackling the problem of AI-assisted note generation from visit dialogue. We also present the benchmark performances of several common state-of-the-art approaches.
