SEME at SemEval-2024 Task 2: Comparing Masked and Generative Language Models on Natural Language Inference for Clinical Trials

Mathilde Aguiar; Pierre Zweigenbaum; Nona Naderi

SEME at SemEval-2024 Task 2: Comparing Masked and Generative Language Models on Natural Language Inference for Clinical Trials

Mathilde Aguiar, Pierre Zweigenbaum, Nona Naderi

TL;DR

The paper tackles Safe Biomedical Natural Language Inference for Clinical Trials (NLI4CT) in SemEval-2024 Task 2 by comparing finetuned ensembles of Masked Language Models with prompt-based Large Language Models that employ Chain-of-Thought strategies. The authors find that prompting Flan-T5-large in a 2-shot setting yields the best reported metrics (F1=0.57, Faithfulness=0.64, Consistency=0.56), while an ensemble of MLMs can achieve the same scores, highlighting a trade-off between efficiency and reasoning depth. The work demonstrates the viability of both discriminative and generative paradigms for clinical NLI and provides insights into how prompt design, evidence usage, and input length influence faithfulness and consistency. It also outlines directions for future improvements, including domain-specific pretraining and ontology integration to enhance robustness in clinical text understanding.

Abstract

This paper describes our submission to Task 2 of SemEval-2024: Safe Biomedical Natural Language Inference for Clinical Trials. The Multi-evidence Natural Language Inference for Clinical Trial Data (NLI4CT) consists of a Textual Entailment (TE) task focused on the evaluation of the consistency and faithfulness of Natural Language Inference (NLI) models applied to Clinical Trial Reports (CTR). We test 2 distinct approaches, one based on finetuning and ensembling Masked Language Models and the other based on prompting Large Language Models using templates, in particular, using Chain-Of-Thought and Contrastive Chain-Of-Thought. Prompting Flan-T5-large in a 2-shot setting leads to our best system that achieves 0.57 F1 score, 0.64 Faithfulness, and 0.56 Consistency.

SEME at SemEval-2024 Task 2: Comparing Masked and Generative Language Models on Natural Language Inference for Clinical Trials

TL;DR

Abstract

Paper Structure (31 sections, 2 equations, 5 figures, 12 tables)

This paper contains 31 sections, 2 equations, 5 figures, 12 tables.

Introduction
Background
Corpus and task description
Related work
System overview
Finetuning pretrained masked language models
Prompting generative large language models
Experimental setup
Data pre-processing
Ensembling MLMs
Prompting generative LLMs
Evaluation
Results
Quantitative analysis
Error analysis
...and 16 more sections

Figures (5)

Figure 1: MLM ensemble architecture overview.
Figure 2: Example of an inference mechanism using a statement and the Eligibility section of a CTR.
Figure 3: Example Zero-shot prompt.
Figure 4: Example Chain-Of-Thought demonstration.
Figure 5: Example Contrastive Chain-Of-Thought demonstration.

SEME at SemEval-2024 Task 2: Comparing Masked and Generative Language Models on Natural Language Inference for Clinical Trials

TL;DR

Abstract

SEME at SemEval-2024 Task 2: Comparing Masked and Generative Language Models on Natural Language Inference for Clinical Trials

Authors

TL;DR

Abstract

Table of Contents

Figures (5)