Table of Contents
Fetching ...

Analysing zero-shot temporal relation extraction on clinical notes using temporal consistency

Vasiliki Kougia, Anastasiia Sedova, Andreas Stephan, Klim Zaporojets, Benjamin Roth

TL;DR

This paper presents the first study for temporal relation extraction in a zero-shot setting focusing on biomedical text and reveals that LLMs face challenges in providing responses consistent with the temporal properties of uniqueness and transitivity.

Abstract

This paper presents the first study for temporal relation extraction in a zero-shot setting focusing on biomedical text. We employ two types of prompts and five LLMs (GPT-3.5, Mixtral, Llama 2, Gemma, and PMC-LLaMA) to obtain responses about the temporal relations between two events. Our experiments demonstrate that LLMs struggle in the zero-shot setting performing worse than fine-tuned specialized models in terms of F1 score, showing that this is a challenging task for LLMs. We further contribute a novel comprehensive temporal analysis by calculating consistency scores for each LLM. Our findings reveal that LLMs face challenges in providing responses consistent to the temporal properties of uniqueness and transitivity. Moreover, we study the relation between the temporal consistency of an LLM and its accuracy and whether the latter can be improved by solving temporal inconsistencies. Our analysis shows that even when temporal consistency is achieved, the predictions can remain inaccurate.

Analysing zero-shot temporal relation extraction on clinical notes using temporal consistency

TL;DR

This paper presents the first study for temporal relation extraction in a zero-shot setting focusing on biomedical text and reveals that LLMs face challenges in providing responses consistent with the temporal properties of uniqueness and transitivity.

Abstract

This paper presents the first study for temporal relation extraction in a zero-shot setting focusing on biomedical text. We employ two types of prompts and five LLMs (GPT-3.5, Mixtral, Llama 2, Gemma, and PMC-LLaMA) to obtain responses about the temporal relations between two events. Our experiments demonstrate that LLMs struggle in the zero-shot setting performing worse than fine-tuned specialized models in terms of F1 score, showing that this is a challenging task for LLMs. We further contribute a novel comprehensive temporal analysis by calculating consistency scores for each LLM. Our findings reveal that LLMs face challenges in providing responses consistent to the temporal properties of uniqueness and transitivity. Moreover, we study the relation between the temporal consistency of an LLM and its accuracy and whether the latter can be improved by solving temporal inconsistencies. Our analysis shows that even when temporal consistency is achieved, the predictions can remain inaccurate.
Paper Structure (24 sections, 3 equations, 4 figures, 5 tables)

This paper contains 24 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: An example of three event pairs annotated with temporal relations. In the right part, the order of the events with respect to time (t) is shown and the consistency of uniqueness and transitivity.
  • Figure 2: Examples of two transitive triples with inconsistent predictions. After the ILP the predictions are consistent but still different from the gold relations.
  • Figure 3: Barplot where each bar represents a range of distances between events in the gold pairs. The y axis shows the F1 score of the predictions for the pairs in each bar.
  • Figure 4: Examples of an interaction with the LLM using two different prompting strategies: BatchQA and Chain-of-Thought.