Chain-of-Though (CoT) prompting strategies for medical error detection and correction
Zhaolong Wu, Abul Hasan, Jinge Wu, Yunsoo Kim, Jason P. Y. Cheung, Teng Zhang, Honghan Wu
TL;DR
This work tackles automatic medical error detection and correction in clinical notes using GPT-4 with two CoT-based In-Context Learning strategies: ICL-RAG-CoT, which guides error detection via three CoT prompts and retrieval-augmented prompts, and ICL-RAG-Reason, which generates explicit reasons for correctness/incorrectness with a similar retrieval setup. An ensemble combines these two methods to optimize all three MEDIQA-CORR 2024 sub-tasks, achieving competitive rankings (top 3 on sub-tasks 1 and 2, and strong NLG performance for sub-task 3). Across datasets derived from the MS and UW sources, the methods show robust error detection and span identification, with the ensemble enhancing natural language generation by leveraging independent reasoning paths. The results highlight the practicality of CoT reasoning and RAG-based prompting for clinical text error correction and point to future work integrating biomedical knowledge bases and extending to open-source LLMs.
Abstract
This paper describes our submission to the MEDIQA-CORR 2024 shared task for automatically detecting and correcting medical errors in clinical notes. We report results for three methods of few-shot In-Context Learning (ICL) augmented with Chain-of-Thought (CoT) and reason prompts using a large language model (LLM). In the first method, we manually analyse a subset of train and validation dataset to infer three CoT prompts by examining error types in the clinical notes. In the second method, we utilise the training dataset to prompt the LLM to deduce reasons about their correctness or incorrectness. The constructed CoTs and reasons are then augmented with ICL examples to solve the tasks of error detection, span identification, and error correction. Finally, we combine the two methods using a rule-based ensemble method. Across the three sub-tasks, our ensemble method achieves a ranking of 3rd for both sub-task 1 and 2, while securing 7th place in sub-task 3 among all submissions.
