SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials

Mael Jullien; Marco Valentino; André Freitas

SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials

Mael Jullien, Marco Valentino, André Freitas

TL;DR

SemEval-2024 Task 2 introduces NLI4CT-P, a perturbed benchmark for safe biomedical natural language inference on clinical trial reports, enabling causal and robustness analysis of NLI models. It defines Faithfulness and Consistency metrics and analyzes 25 submissions across 12 architectures, showing generative models generally outperform discriminative ones and that additional data and instruction tuning yield significant gains, especially in faithfulness. The study further reveals that prompting strategy and model size interact in complex ways, with zero-shot prompting frequently outperforming few-shot and mid-sized architectures offering cost-effective competitiveness. Overall, the work highlights that high F1 alone does not guarantee trustworthy clinical NLI, and it outlines avenues for future causal evaluation and intervention-level analysis to advance safe AI in clinical decision support.

Abstract

Large Language Models (LLMs) are at the forefront of NLP achievements but fall short in dealing with shortcut learning, factual inconsistency, and vulnerability to adversarial inputs.These shortcomings are especially critical in medical contexts, where they can misrepresent actual model capabilities. Addressing this, we present SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for ClinicalTrials. Our contributions include the refined NLI4CT-P dataset (i.e., Natural Language Inference for Clinical Trials - Perturbed), designed to challenge LLMs with interventional and causal reasoning tasks, along with a comprehensive evaluation of methods and results for participant submissions. A total of 106 participants registered for the task contributing to over 1200 individual submissions and 25 system overview papers. This initiative aims to advance the robustness and applicability of NLI models in healthcare, ensuring safer and more dependable AI assistance in clinical decision-making. We anticipate that the dataset, models, and outcomes of this task can support future research in the field of biomedical NLI. The dataset, competition leaderboard, and website are publicly available.

SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials

TL;DR

Abstract

Paper Structure (31 sections, 2 equations, 5 figures, 4 tables)

This paper contains 31 sections, 2 equations, 5 figures, 4 tables.

Introduction
Challenges in Clinical NLI:
Importance of Faithfulness and Consistency Evaluation:
Superiority of Generative Models:
Value of Additional Data:
Impact of Prompting Strategies:
Efficacy of Mid-Sized Architectures:
Task Description
Dataset
Interventions
Paraphrasing and Contradiction Rephrasing
Numerical Paraphrasing and Contradiction
Appending Text
Evaluation
Faithfulness
...and 16 more sections

Figures (5)

Figure 1: The goal of NLI4CT is to predict the relationship of entailment between a Statement and a CTR premise jullien-etal-2023-nli4ct. In this task, we introduce a set of perturbations (NLI4CT-P) applied to the statements to test the semantic consistency and faithfulness of NLI models.
Figure 2: Comparative Analysis of F1, Consistency, and Faithfulness Across Model Types
Figure 3: Comparative Analysis of F1, Consistency, and Faithfulness Across Model Parameter Numbers
Figure 4: Comparative Analysis of F1, Consistency, and Faithfulness Across Prompting strategies
Figure 5: Comparative Analysis of F1, Consistency, and Faithfulness Across Training Strategies

SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials

TL;DR

Abstract

SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials

Authors

TL;DR

Abstract

Table of Contents

Figures (5)