Table of Contents
Fetching ...

DFKI-NLP at SemEval-2024 Task 2: Towards Robust LLMs Using Data Perturbations and MinMax Training

Bhuvanesh Verma, Lisa Raithel

TL;DR

This work tackles robust natural language inference for Clinical Trial Reports (NLI4CT) by combining instruction-tuned LLMs with a MinMax auxiliary model and data perturbations focused on acronyms and numerical values. The Mistral 7B–based system is augmented with LoRA PEFT and MedNLI pre-finetuning to enhance biomedical alignment, and a dedicated auxiliary learner directs the model toward hard examples. Perturbations reveal that acronym adjustments improve semantic-definitional cases while numerical perturbations influence semantic-preserving interventions; the combined approach yields mixed effects across interventions and sections, with strong performance for Adverse Events and numerical contradictions ($F_1$ up to 0.93). Overall, the MinMax-based robustness framework provides meaningful gains in Faithfulness and Consistency, offering practical pathways to safer, more reliable clinical NLP systems. The analysis of easy vs hard samples and section-level difficulties informs future work on numerical reasoning and data-cartography–driven curriculum design in biomedical NLI.

Abstract

The NLI4CT task at SemEval-2024 emphasizes the development of robust models for Natural Language Inference on Clinical Trial Reports (CTRs) using large language models (LLMs). This edition introduces interventions specifically targeting the numerical, vocabulary, and semantic aspects of CTRs. Our proposed system harnesses the capabilities of the state-of-the-art Mistral model, complemented by an auxiliary model, to focus on the intricate input space of the NLI4CT dataset. Through the incorporation of numerical and acronym-based perturbations to the data, we train a robust system capable of handling both semantic-altering and numerical contradiction interventions. Our analysis on the dataset sheds light on the challenging sections of the CTRs for reasoning.

DFKI-NLP at SemEval-2024 Task 2: Towards Robust LLMs Using Data Perturbations and MinMax Training

TL;DR

This work tackles robust natural language inference for Clinical Trial Reports (NLI4CT) by combining instruction-tuned LLMs with a MinMax auxiliary model and data perturbations focused on acronyms and numerical values. The Mistral 7B–based system is augmented with LoRA PEFT and MedNLI pre-finetuning to enhance biomedical alignment, and a dedicated auxiliary learner directs the model toward hard examples. Perturbations reveal that acronym adjustments improve semantic-definitional cases while numerical perturbations influence semantic-preserving interventions; the combined approach yields mixed effects across interventions and sections, with strong performance for Adverse Events and numerical contradictions ( up to 0.93). Overall, the MinMax-based robustness framework provides meaningful gains in Faithfulness and Consistency, offering practical pathways to safer, more reliable clinical NLP systems. The analysis of easy vs hard samples and section-level difficulties informs future work on numerical reasoning and data-cartography–driven curriculum design in biomedical NLI.

Abstract

The NLI4CT task at SemEval-2024 emphasizes the development of robust models for Natural Language Inference on Clinical Trial Reports (CTRs) using large language models (LLMs). This edition introduces interventions specifically targeting the numerical, vocabulary, and semantic aspects of CTRs. Our proposed system harnesses the capabilities of the state-of-the-art Mistral model, complemented by an auxiliary model, to focus on the intricate input space of the NLI4CT dataset. Through the incorporation of numerical and acronym-based perturbations to the data, we train a robust system capable of handling both semantic-altering and numerical contradiction interventions. Our analysis on the dataset sheds light on the challenging sections of the CTRs for reasoning.
Paper Structure (34 sections, 1 equation, 6 figures, 7 tables)

This paper contains 34 sections, 1 equation, 6 figures, 7 tables.

Figures (6)

  • Figure 1: A sample instance from the NLI4CT dataset. Each instance consists of four sections: Intervention, Eligibility criteria, Results, and Adverse Events. The data are split into two types: single (depicted) and comparison. In single, one section of the CTR serves as the premise (in this case, Adverse Events). A human-annotated hypothesis for this premise is given (Statement), which is then to be classified into either entailment or contradiction.
  • Figure 2: Weight distribution of NLI4CT data instances generated by the auxiliary model after 3 epochs of training. Lower weights correspond to easy examples, and higher weights correspond to hard examples.
  • Figure 3: Data map for the NLI4CT dataset following swayamdipta2020dataset.
  • Figure 4: Word overlap between the hypothesis and the premise in the easy and the hard examples.
  • Figure 5: Final design for prompting.
  • ...and 1 more figures