Table of Contents
Fetching ...

A Course Shared Task on Evaluating LLM Output for Clinical Questions

Yufang Hou, Thy Thy Tran, Doan Nam Long Vu, Yiwen Cao, Kai Li, Lukas Rohde, Iryna Gurevych

TL;DR

This work introduces a course-based shared task to evaluate LLM outputs on harmful health-related clinical questions, bridging NLP education with practical evaluation. The authors design a four-sub-task framework around harmfulness detection and fine-grained sentence categorization, grounded in Cochrane Clinical Answers (CCA), and assess multiple LLMs (including Llama-2-70b-chat with two prompts, ChatGPT, BingChat, and PerplexityAI) on dev/test splits. A key contribution is the creation of a large annotated dataset (1,800 labeled answers for 360 CCAs) using Label Studio, plus an open-source dataset donation (850 annotations for 130 CCAs) to support future teaching and research. The paper provides actionable insights for educators on task design, annotation workload, and evaluation strategies, including a leaderboard-based grading scheme with a formula $c = \frac{20}{n} (n+1 - k)$ and considerations to mitigate leakage, thereby offering a practical blueprint for responsible AI education and reproducible NLP assessment.

Abstract

This paper presents a shared task that we organized at the Foundations of Language Technology (FoLT) course in 2023/2024 at the Technical University of Darmstadt, which focuses on evaluating the output of Large Language Models (LLMs) in generating harmful answers to health-related clinical questions. We describe the task design considerations and report the feedback we received from the students. We expect the task and the findings reported in this paper to be relevant for instructors teaching natural language processing (NLP) and designing course assignments.

A Course Shared Task on Evaluating LLM Output for Clinical Questions

TL;DR

This work introduces a course-based shared task to evaluate LLM outputs on harmful health-related clinical questions, bridging NLP education with practical evaluation. The authors design a four-sub-task framework around harmfulness detection and fine-grained sentence categorization, grounded in Cochrane Clinical Answers (CCA), and assess multiple LLMs (including Llama-2-70b-chat with two prompts, ChatGPT, BingChat, and PerplexityAI) on dev/test splits. A key contribution is the creation of a large annotated dataset (1,800 labeled answers for 360 CCAs) using Label Studio, plus an open-source dataset donation (850 annotations for 130 CCAs) to support future teaching and research. The paper provides actionable insights for educators on task design, annotation workload, and evaluation strategies, including a leaderboard-based grading scheme with a formula and considerations to mitigate leakage, thereby offering a practical blueprint for responsible AI education and reproducible NLP assessment.

Abstract

This paper presents a shared task that we organized at the Foundations of Language Technology (FoLT) course in 2023/2024 at the Technical University of Darmstadt, which focuses on evaluating the output of Large Language Models (LLMs) in generating harmful answers to health-related clinical questions. We describe the task design considerations and report the feedback we received from the students. We expect the task and the findings reported in this paper to be relevant for instructors teaching natural language processing (NLP) and designing course assignments.
Paper Structure (10 sections, 1 figure, 2 tables)

This paper contains 10 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Annotating LLM answers with fine-grained categories