Table of Contents
Fetching ...

Large Language Models for Medical OSCE Assessment: A Novel Approach to Transcript Analysis

Ameer Hamza Shakur, Michael J. Holcomb, David Hein, Shinyoung Kang, Thomas O. Dalton, Krystle K. Campbell, Daniel J. Scott, Andrew R. Jamieson

TL;DR

This study examines the potential of Large Language Models to assess skills related to medical student communication and presents a failure analysis identifying conditions where LLM grading may be less reliable in this context and recommend best practices for deploying LLMs in medical education settings.

Abstract

Grading Objective Structured Clinical Examinations (OSCEs) is a time-consuming and expensive process, traditionally requiring extensive manual effort from human experts. In this study, we explore the potential of Large Language Models (LLMs) to assess skills related to medical student communication. We analyzed 2,027 video-recorded OSCE examinations from the University of Texas Southwestern Medical Center (UTSW), spanning four years (2019-2022), and several different medical cases or "stations." Specifically, our focus was on evaluating students' ability to summarize patients' medical history: we targeted the rubric item 'did the student summarize the patients' medical history?' from the communication skills rubric. After transcribing speech audio captured by OSCE videos using Whisper-v3, we studied the performance of various LLM-based approaches for grading students on this summarization task based on their examination transcripts. Using various frontier-level open-source and proprietary LLMs, we evaluated different techniques such as zero-shot chain-of-thought prompting, retrieval augmented generation, and multi-model ensemble methods. Our results show that frontier LLM models like GPT-4 achieved remarkable alignment with human graders, demonstrating a Cohen's kappa agreement of 0.88 and indicating strong potential for LLM-based OSCE grading to augment the current grading process. Open-source models also showed promising results, suggesting potential for widespread, cost-effective deployment. Further, we present a failure analysis identifying conditions where LLM grading may be less reliable in this context and recommend best practices for deploying LLMs in medical education settings.

Large Language Models for Medical OSCE Assessment: A Novel Approach to Transcript Analysis

TL;DR

This study examines the potential of Large Language Models to assess skills related to medical student communication and presents a failure analysis identifying conditions where LLM grading may be less reliable in this context and recommend best practices for deploying LLMs in medical education settings.

Abstract

Grading Objective Structured Clinical Examinations (OSCEs) is a time-consuming and expensive process, traditionally requiring extensive manual effort from human experts. In this study, we explore the potential of Large Language Models (LLMs) to assess skills related to medical student communication. We analyzed 2,027 video-recorded OSCE examinations from the University of Texas Southwestern Medical Center (UTSW), spanning four years (2019-2022), and several different medical cases or "stations." Specifically, our focus was on evaluating students' ability to summarize patients' medical history: we targeted the rubric item 'did the student summarize the patients' medical history?' from the communication skills rubric. After transcribing speech audio captured by OSCE videos using Whisper-v3, we studied the performance of various LLM-based approaches for grading students on this summarization task based on their examination transcripts. Using various frontier-level open-source and proprietary LLMs, we evaluated different techniques such as zero-shot chain-of-thought prompting, retrieval augmented generation, and multi-model ensemble methods. Our results show that frontier LLM models like GPT-4 achieved remarkable alignment with human graders, demonstrating a Cohen's kappa agreement of 0.88 and indicating strong potential for LLM-based OSCE grading to augment the current grading process. Open-source models also showed promising results, suggesting potential for widespread, cost-effective deployment. Further, we present a failure analysis identifying conditions where LLM grading may be less reliable in this context and recommend best practices for deploying LLMs in medical education settings.

Paper Structure

This paper contains 22 sections, 9 figures, 11 tables.

Figures (9)

  • Figure 1: No. of COSCE exams and proportion of students that received full credit on 'summary of medical history' for each year
  • Figure 2: Distribution of scores on 'summary of medical history' item of the communication skills rubric, with 2 being full credit, 1 partial credit, and 0 no credit.
  • Figure 3: Schematic showing the current human-expert grading process of OSCE's. Cameras in the examination room record the OSCE encounters, which are then reviewed by two trained human experts who grade the student examination.
  • Figure 4: Schematic representation of our zero-shot grading workflow for OSCE assessments. First, Whisper-v3 speech recognition is used to transcribe the recorded OSCE encounters. Then the transcript is analyzed by an LLM to assess student performance.
  • Figure 5: Schematic of retrieval-augmented grading workflow for OSCE exams, combining LLM capabilities with embedding-based information retrieval.
  • ...and 4 more figures