Table of Contents
Fetching ...

D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models

Duygu Altinok

TL;DR

The paper addresses evaluating natural language inference capabilities of both open-source and closed-source LLMs in the medical domain, using clinical trial reports and a SemEval-2024 Task 2 contrast set. It systematically compares multiple LLMs, revealing that Gemini Pro achieves the strongest performance on development data and competitive results on the test set, with open-source Falcon 40B close behind in some scenarios. The study highlights distinct failure modes, particularly in numerical-quantitative reasoning and scenario-specific inference, while showing that modern LLMs can perform nontrivial medical NLI with meaningful accuracy. These findings suggest practical potential for medical inference tasks, albeit with caution and the need for further data, compute, and robust evaluation in high-stakes settings.

Abstract

Large language models (LLMs) have garnered significant attention and widespread usage due to their impressive performance in various tasks. However, they are not without their own set of challenges, including issues such as hallucinations, factual inconsistencies, and limitations in numerical-quantitative reasoning. Evaluating LLMs in miscellaneous reasoning tasks remains an active area of research. Prior to the breakthrough of LLMs, Transformers had already proven successful in the medical domain, effectively employed for various natural language understanding (NLU) tasks. Following this trend, LLMs have also been trained and utilized in the medical domain, raising concerns regarding factual accuracy, adherence to safety protocols, and inherent limitations. In this paper, we focus on evaluating the natural language inference capabilities of popular open-source and closed-source LLMs using clinical trial reports as the dataset. We present the performance results of each LLM and further analyze their performance on a development set, particularly focusing on challenging instances that involve medical abbreviations and require numerical-quantitative reasoning. Gemini, our leading LLM, achieved a test set F1-score of 0.748, securing the ninth position on the task scoreboard. Our work is the first of its kind, offering a thorough examination of the inference capabilities of LLMs within the medical domain.

D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models

TL;DR

The paper addresses evaluating natural language inference capabilities of both open-source and closed-source LLMs in the medical domain, using clinical trial reports and a SemEval-2024 Task 2 contrast set. It systematically compares multiple LLMs, revealing that Gemini Pro achieves the strongest performance on development data and competitive results on the test set, with open-source Falcon 40B close behind in some scenarios. The study highlights distinct failure modes, particularly in numerical-quantitative reasoning and scenario-specific inference, while showing that modern LLMs can perform nontrivial medical NLI with meaningful accuracy. These findings suggest practical potential for medical inference tasks, albeit with caution and the need for further data, compute, and robust evaluation in high-stakes settings.

Abstract

Large language models (LLMs) have garnered significant attention and widespread usage due to their impressive performance in various tasks. However, they are not without their own set of challenges, including issues such as hallucinations, factual inconsistencies, and limitations in numerical-quantitative reasoning. Evaluating LLMs in miscellaneous reasoning tasks remains an active area of research. Prior to the breakthrough of LLMs, Transformers had already proven successful in the medical domain, effectively employed for various natural language understanding (NLU) tasks. Following this trend, LLMs have also been trained and utilized in the medical domain, raising concerns regarding factual accuracy, adherence to safety protocols, and inherent limitations. In this paper, we focus on evaluating the natural language inference capabilities of popular open-source and closed-source LLMs using clinical trial reports as the dataset. We present the performance results of each LLM and further analyze their performance on a development set, particularly focusing on challenging instances that involve medical abbreviations and require numerical-quantitative reasoning. Gemini, our leading LLM, achieved a test set F1-score of 0.748, securing the ninth position on the task scoreboard. Our work is the first of its kind, offering a thorough examination of the inference capabilities of LLMs within the medical domain.
Paper Structure (11 sections, 13 figures, 4 tables)

This paper contains 11 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: An example comparison task from the training set with two CTRs.
  • Figure 2: An example comparison task from the training set with two CTRs.
  • Figure 3: Initiation of the conversation with PaLM.
  • Figure 4: A challenging instance that was incorrectly predicted by the top-performing LLMs.
  • Figure 5: Responses of the top-performing LLMs to the selected challenging instance, where all models failed to exhibit any signs of numerical inference.
  • ...and 8 more figures