Table of Contents
Fetching ...

Will LLMs Replace the Encoder-Only Models in Temporal Relation Classification?

Gabriel Roccabruna, Massimo Rizzoli, Giuseppe Riccardi

TL;DR

This work investigates LLMs’ performance and decision process in the Temporal Relation Classification task, and shows that LLMs with in-context learning significantly underperform smaller encoder-only models based on RoBERTa.

Abstract

The automatic detection of temporal relations among events has been mainly investigated with encoder-only models such as RoBERTa. Large Language Models (LLM) have recently shown promising performance in temporal reasoning tasks such as temporal question answering. Nevertheless, recent studies have tested the LLMs' performance in detecting temporal relations of closed-source models only, limiting the interpretability of those results. In this work, we investigate LLMs' performance and decision process in the Temporal Relation Classification task. First, we assess the performance of seven open and closed-sourced LLMs experimenting with in-context learning and lightweight fine-tuning approaches. Results show that LLMs with in-context learning significantly underperform smaller encoder-only models based on RoBERTa. Then, we delve into the possible reasons for this gap by applying explainable methods. The outcome suggests a limitation of LLMs in this task due to their autoregressive nature, which causes them to focus only on the last part of the sequence. Additionally, we evaluate the word embeddings of these two models to better understand their pre-training differences. The code and the fine-tuned models can be found respectively on GitHub.

Will LLMs Replace the Encoder-Only Models in Temporal Relation Classification?

TL;DR

This work investigates LLMs’ performance and decision process in the Temporal Relation Classification task, and shows that LLMs with in-context learning significantly underperform smaller encoder-only models based on RoBERTa.

Abstract

The automatic detection of temporal relations among events has been mainly investigated with encoder-only models such as RoBERTa. Large Language Models (LLM) have recently shown promising performance in temporal reasoning tasks such as temporal question answering. Nevertheless, recent studies have tested the LLMs' performance in detecting temporal relations of closed-source models only, limiting the interpretability of those results. In this work, we investigate LLMs' performance and decision process in the Temporal Relation Classification task. First, we assess the performance of seven open and closed-sourced LLMs experimenting with in-context learning and lightweight fine-tuning approaches. Results show that LLMs with in-context learning significantly underperform smaller encoder-only models based on RoBERTa. Then, we delve into the possible reasons for this gap by applying explainable methods. The outcome suggests a limitation of LLMs in this task due to their autoregressive nature, which causes them to focus only on the last part of the sequence. Additionally, we evaluate the word embeddings of these two models to better understand their pre-training differences. The code and the fine-tuned models can be found respectively on GitHub.

Paper Structure

This paper contains 16 sections, 3 figures, 11 tables.

Figures (3)

  • Figure 1: An example taken from MATRES corpus for the Temporal Relation Classification task, in which the accusation event follows the driving event. The relation between the two event triggers, namely $e_1$:accused and $e_2$:driving, is annotated with a directed arc and the label AFTER.
  • Figure 2: Distribution of the five tokens for each input sequence with the highest attribute score computed with Llama2 7Btouvron2023llama based on the input sequence. Corpora on the y-axis and relative position in the input sequence on the x-axis. The blue line is the median.
  • Figure 3: Distribution of the five tokens for each input sequence with the highest attribute score computed with RoBERTaliu2019roberta based on the input sequence. Corpora on the y-axis and relative position in the input sequence on the x-axis. The blue line is the median.