Table of Contents
Fetching ...

Edinburgh Clinical NLP at SemEval-2024 Task 2: Fine-tune your model unless you have access to GPT-4

Aryo Pradipta Gema, Giwon Hong, Pasquale Minervini, Luke Daines, Beatrice Alex

TL;DR

This work tackles NLI4CT by evaluating multiple LLMs on clinical trial evidence to determine entailment versus contradiction, with a focus on faithfulness and consistency. It introduces a parameter-efficient fine-tuning approach that merges two adapters trained on distinct objectives (LM and triplet losses) to improve consistency, and systematically compares prompting, CoT, and PEFT across models. Results show LoRA-based fine-tuning improves performance across models, and adapter merging yields notable F1 gains, though GPT-4 remains superior in faithfulness and overall reliability. A contamination analysis for GPT-4 finds no clear data leakage, underscoring the robustness of the evaluation, while highlighting the potential and limitations of PEFT as a practical tool for domain adaptation in clinical NLP.

Abstract

The NLI4CT task assesses Natural Language Inference systems in predicting whether hypotheses entail or contradict evidence from Clinical Trial Reports. In this study, we evaluate various Large Language Models (LLMs) with multiple strategies, including Chain-of-Thought, In-Context Learning, and Parameter-Efficient Fine-Tuning (PEFT). We propose a PEFT method to improve the consistency of LLMs by merging adapters that were fine-tuned separately using triplet and language modelling objectives. We found that merging the two PEFT adapters improves the F1 score (+0.0346) and consistency (+0.152) of the LLMs. However, our novel methods did not produce more accurate results than GPT-4 in terms of faithfulness and consistency. Averaging the three metrics, GPT-4 ranks joint-first in the competition with 0.8328. Finally, our contamination analysis with GPT-4 indicates that there was no test data leakage.

Edinburgh Clinical NLP at SemEval-2024 Task 2: Fine-tune your model unless you have access to GPT-4

TL;DR

This work tackles NLI4CT by evaluating multiple LLMs on clinical trial evidence to determine entailment versus contradiction, with a focus on faithfulness and consistency. It introduces a parameter-efficient fine-tuning approach that merges two adapters trained on distinct objectives (LM and triplet losses) to improve consistency, and systematically compares prompting, CoT, and PEFT across models. Results show LoRA-based fine-tuning improves performance across models, and adapter merging yields notable F1 gains, though GPT-4 remains superior in faithfulness and overall reliability. A contamination analysis for GPT-4 finds no clear data leakage, underscoring the robustness of the evaluation, while highlighting the potential and limitations of PEFT as a practical tool for domain adaptation in clinical NLP.

Abstract

The NLI4CT task assesses Natural Language Inference systems in predicting whether hypotheses entail or contradict evidence from Clinical Trial Reports. In this study, we evaluate various Large Language Models (LLMs) with multiple strategies, including Chain-of-Thought, In-Context Learning, and Parameter-Efficient Fine-Tuning (PEFT). We propose a PEFT method to improve the consistency of LLMs by merging adapters that were fine-tuned separately using triplet and language modelling objectives. We found that merging the two PEFT adapters improves the F1 score (+0.0346) and consistency (+0.152) of the LLMs. However, our novel methods did not produce more accurate results than GPT-4 in terms of faithfulness and consistency. Averaging the three metrics, GPT-4 ranks joint-first in the competition with 0.8328. Finally, our contamination analysis with GPT-4 indicates that there was no test data leakage.
Paper Structure (29 sections, 3 equations, 2 figures, 9 tables)

This paper contains 29 sections, 3 equations, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Our inference schema with multiple prompting strategies (without fine-tuning). For Chain-of-Thought examples, Natural Language Explanation was generated using ChatGPT he2023using.
  • Figure 2: Our proposed fine-tuning scheme on SemEval 2024-Task 2. We suggested merging Adapters trained through Language Modelling (LM) Fine-tuning based on language modelling loss (in predicting either "Entailment" or "Contradiction") with Adapters trained through Triplet Fine-tuning based on triplet loss.