SEME at SemEval-2024 Task 2: Comparing Masked and Generative Language Models on Natural Language Inference for Clinical Trials
Mathilde Aguiar, Pierre Zweigenbaum, Nona Naderi
TL;DR
The paper tackles Safe Biomedical Natural Language Inference for Clinical Trials (NLI4CT) in SemEval-2024 Task 2 by comparing finetuned ensembles of Masked Language Models with prompt-based Large Language Models that employ Chain-of-Thought strategies. The authors find that prompting Flan-T5-large in a 2-shot setting yields the best reported metrics (F1=0.57, Faithfulness=0.64, Consistency=0.56), while an ensemble of MLMs can achieve the same scores, highlighting a trade-off between efficiency and reasoning depth. The work demonstrates the viability of both discriminative and generative paradigms for clinical NLI and provides insights into how prompt design, evidence usage, and input length influence faithfulness and consistency. It also outlines directions for future improvements, including domain-specific pretraining and ontology integration to enhance robustness in clinical text understanding.
Abstract
This paper describes our submission to Task 2 of SemEval-2024: Safe Biomedical Natural Language Inference for Clinical Trials. The Multi-evidence Natural Language Inference for Clinical Trial Data (NLI4CT) consists of a Textual Entailment (TE) task focused on the evaluation of the consistency and faithfulness of Natural Language Inference (NLI) models applied to Clinical Trial Reports (CTR). We test 2 distinct approaches, one based on finetuning and ensembling Masked Language Models and the other based on prompting Large Language Models using templates, in particular, using Chain-Of-Thought and Contrastive Chain-Of-Thought. Prompting Flan-T5-large in a 2-shot setting leads to our best system that achieves 0.57 F1 score, 0.64 Faithfulness, and 0.56 Consistency.
