FZI-WIM at SemEval-2024 Task 2: Self-Consistent CoT for Complex NLI in Biomedical Domain

Jin Liu; Steffen Thoma

FZI-WIM at SemEval-2024 Task 2: Self-Consistent CoT for Complex NLI in Biomedical Domain

Jin Liu, Steffen Thoma

TL;DR

This work addresses Safe Biomedical NLI for Clinical Trials (NLI4CT) by integrating Chain-of-Thought reasoning with self-consistency to produce faithful and transparent inferences in the biomedical domain. The authors distill GPT-4 generated CoT rationales into a LoRA-tuned open model (Mixtral-8x7B-Instruct) and apply multiple CoT chains with majority voting during inference, aiming to improve faithfulness and consistency over traditional label-only or greedy CoT approaches. Empirical results show competitive baseline F1 ($0.800$) and strong faithfulness ($\approx0.903$) with reasonable consistency ($\approx0.729$), though overall gains over simpler baselines are modest due to a limited number of reasoning chains and aggregation challenges. The study demonstrates that domain-specific knowledge can be effectively infused into a compact model via LoRA and highlights the need for more granular evaluation of intermediate reasoning in CoT for robust clinical reasoning tasks.

Abstract

This paper describes the inference system of FZI-WIM at the SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials. Our system utilizes the chain of thought (CoT) paradigm to tackle this complex reasoning problem and further improves the CoT performance with self-consistency. Instead of greedy decoding, we sample multiple reasoning chains with the same prompt and make the final verification with majority voting. The self-consistent CoT system achieves a baseline F1 score of 0.80 (1st), faithfulness score of 0.90 (3rd), and consistency score of 0.73 (12th). We release the code and data publicly https://github.com/jens5588/FZI-WIM-NLI4CT.

FZI-WIM at SemEval-2024 Task 2: Self-Consistent CoT for Complex NLI in Biomedical Domain

TL;DR

) and strong faithfulness (

) with reasonable consistency (

), though overall gains over simpler baselines are modest due to a limited number of reasoning chains and aggregation challenges. The study demonstrates that domain-specific knowledge can be effectively infused into a compact model via LoRA and highlights the need for more granular evaluation of intermediate reasoning in CoT for robust clinical reasoning tasks.

Abstract

Paper Structure (25 sections, 3 equations, 7 figures, 5 tables)

This paper contains 25 sections, 3 equations, 7 figures, 5 tables.

Introduction
Background
System overview
Knowledge Distillation
LoRA Instruction-tuning
Self-Consistency
Experimental setup
LoRA Instruction-tuning
Inference
Evaluation Metrics
Evaluation
Baseline F1
Consistency
Faithfulness
Self-consistent CoT and CoT Greedy
...and 10 more sections

Figures (7)

Figure 1: A data example. With the same clinical report, semantic-preserving and semantic-altering interventions on the original statement are used to evaluate the consistency and faithfulness of the verification system.
Figure 2: Training and Inference pipeline of self-consistent CoT system.
Figure 3: Example prompt for GPT-4
Figure 4: An example of CoT instruction-tuning dataset
Figure 5: An example of label-only instruction-tuning dataset
...and 2 more figures

FZI-WIM at SemEval-2024 Task 2: Self-Consistent CoT for Complex NLI in Biomedical Domain

TL;DR

Abstract

FZI-WIM at SemEval-2024 Task 2: Self-Consistent CoT for Complex NLI in Biomedical Domain

Authors

TL;DR

Abstract

Table of Contents

Figures (7)