FZI-WIM at SemEval-2024 Task 2: Self-Consistent CoT for Complex NLI in Biomedical Domain
Jin Liu, Steffen Thoma
TL;DR
This work addresses Safe Biomedical NLI for Clinical Trials (NLI4CT) by integrating Chain-of-Thought reasoning with self-consistency to produce faithful and transparent inferences in the biomedical domain. The authors distill GPT-4 generated CoT rationales into a LoRA-tuned open model (Mixtral-8x7B-Instruct) and apply multiple CoT chains with majority voting during inference, aiming to improve faithfulness and consistency over traditional label-only or greedy CoT approaches. Empirical results show competitive baseline F1 ($0.800$) and strong faithfulness ($\approx0.903$) with reasonable consistency ($\approx0.729$), though overall gains over simpler baselines are modest due to a limited number of reasoning chains and aggregation challenges. The study demonstrates that domain-specific knowledge can be effectively infused into a compact model via LoRA and highlights the need for more granular evaluation of intermediate reasoning in CoT for robust clinical reasoning tasks.
Abstract
This paper describes the inference system of FZI-WIM at the SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials. Our system utilizes the chain of thought (CoT) paradigm to tackle this complex reasoning problem and further improves the CoT performance with self-consistency. Instead of greedy decoding, we sample multiple reasoning chains with the same prompt and make the final verification with majority voting. The self-consistent CoT system achieves a baseline F1 score of 0.80 (1st), faithfulness score of 0.90 (3rd), and consistency score of 0.73 (12th). We release the code and data publicly https://github.com/jens5588/FZI-WIM-NLI4CT.
