Table of Contents
Fetching ...

Chain-of-Though (CoT) prompting strategies for medical error detection and correction

Zhaolong Wu, Abul Hasan, Jinge Wu, Yunsoo Kim, Jason P. Y. Cheung, Teng Zhang, Honghan Wu

TL;DR

This work tackles automatic medical error detection and correction in clinical notes using GPT-4 with two CoT-based In-Context Learning strategies: ICL-RAG-CoT, which guides error detection via three CoT prompts and retrieval-augmented prompts, and ICL-RAG-Reason, which generates explicit reasons for correctness/incorrectness with a similar retrieval setup. An ensemble combines these two methods to optimize all three MEDIQA-CORR 2024 sub-tasks, achieving competitive rankings (top 3 on sub-tasks 1 and 2, and strong NLG performance for sub-task 3). Across datasets derived from the MS and UW sources, the methods show robust error detection and span identification, with the ensemble enhancing natural language generation by leveraging independent reasoning paths. The results highlight the practicality of CoT reasoning and RAG-based prompting for clinical text error correction and point to future work integrating biomedical knowledge bases and extending to open-source LLMs.

Abstract

This paper describes our submission to the MEDIQA-CORR 2024 shared task for automatically detecting and correcting medical errors in clinical notes. We report results for three methods of few-shot In-Context Learning (ICL) augmented with Chain-of-Thought (CoT) and reason prompts using a large language model (LLM). In the first method, we manually analyse a subset of train and validation dataset to infer three CoT prompts by examining error types in the clinical notes. In the second method, we utilise the training dataset to prompt the LLM to deduce reasons about their correctness or incorrectness. The constructed CoTs and reasons are then augmented with ICL examples to solve the tasks of error detection, span identification, and error correction. Finally, we combine the two methods using a rule-based ensemble method. Across the three sub-tasks, our ensemble method achieves a ranking of 3rd for both sub-task 1 and 2, while securing 7th place in sub-task 3 among all submissions.

Chain-of-Though (CoT) prompting strategies for medical error detection and correction

TL;DR

This work tackles automatic medical error detection and correction in clinical notes using GPT-4 with two CoT-based In-Context Learning strategies: ICL-RAG-CoT, which guides error detection via three CoT prompts and retrieval-augmented prompts, and ICL-RAG-Reason, which generates explicit reasons for correctness/incorrectness with a similar retrieval setup. An ensemble combines these two methods to optimize all three MEDIQA-CORR 2024 sub-tasks, achieving competitive rankings (top 3 on sub-tasks 1 and 2, and strong NLG performance for sub-task 3). Across datasets derived from the MS and UW sources, the methods show robust error detection and span identification, with the ensemble enhancing natural language generation by leveraging independent reasoning paths. The results highlight the practicality of CoT reasoning and RAG-based prompting for clinical text error correction and point to future work integrating biomedical knowledge bases and extending to open-source LLMs.

Abstract

This paper describes our submission to the MEDIQA-CORR 2024 shared task for automatically detecting and correcting medical errors in clinical notes. We report results for three methods of few-shot In-Context Learning (ICL) augmented with Chain-of-Thought (CoT) and reason prompts using a large language model (LLM). In the first method, we manually analyse a subset of train and validation dataset to infer three CoT prompts by examining error types in the clinical notes. In the second method, we utilise the training dataset to prompt the LLM to deduce reasons about their correctness or incorrectness. The constructed CoTs and reasons are then augmented with ICL examples to solve the tasks of error detection, span identification, and error correction. Finally, we combine the two methods using a rule-based ensemble method. Across the three sub-tasks, our ensemble method achieves a ranking of 3rd for both sub-task 1 and 2, while securing 7th place in sub-task 3 among all submissions.
Paper Structure (15 sections, 6 figures, 1 table)

This paper contains 15 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Three types of Chain-of-Thought (CoT) prompts utilised in the ICL-RAG-CoT method: (1), (2), and (3) direct the GPT4 model to focus on intervention, diagnostic, and management errors, respectively.
  • Figure 2: Reason generation template utlised in the ICL-RAG-Reason method
  • Figure 3: Comparison of few-shot examples with or without CoT using ICL-RAG-CoT method on the Binary Classification Task (i.e. sub-task 1) on the MS Validation Set
  • Figure 4: A template used in ICL-RAG-CoT for the few-shot prompting to solve sub-task 1 and 2.
  • Figure 5: A template used in ICL-RAG-Reason for the few-shot prompting to solve all sub-tasks simultaneously.
  • ...and 1 more figures