Table of Contents
Fetching ...

(Chat)GPT v BERT: Dawn of Justice for Semantic Change Detection

Francesco Periti, Haim Dubossarsky, Nina Tahmasebi

TL;DR

This work addresses semantic change detection by evaluating off-the-shelf (Chat)GPT-3.5 against BERT on two diachronic WiC tasks: TempoWiC (short-term) and a newly introduced HistoWiC (long-term). The authors propose a controlled experimental framework with automatic prompts, varying in-context learning strategies, and a comparison across GPT API and ChatGPT Web, including a direct BERT baseline via layer-wise cosine-thresholding. Results show that GPT-3.5 generally underperforms BERT, particularly for short-term changes, though it shows relatively stronger performance on long-term historical change; API-based evaluation is more reliable than the web interface. The study highlights limitations of off-the-shelf ChatGPT for diachronic semantics and suggests that modern BERT-style embeddings remain robust baselines, while pointing to GPT-4 as a potential future improvement for lexical semantic change tasks.

Abstract

In the universe of Natural Language Processing, Transformer-based language models like BERT and (Chat)GPT have emerged as lexical superheroes with great power to solve open research problems. In this paper, we specifically focus on the temporal problem of semantic change, and evaluate their ability to solve two diachronic extensions of the Word-in-Context (WiC) task: TempoWiC and HistoWiC. In particular, we investigate the potential of a novel, off-the-shelf technology like ChatGPT (and GPT) 3.5 compared to BERT, which represents a family of models that currently stand as the state-of-the-art for modeling semantic change. Our experiments represent the first attempt to assess the use of (Chat)GPT for studying semantic change. Our results indicate that ChatGPT performs significantly worse than the foundational GPT version. Furthermore, our results demonstrate that (Chat)GPT achieves slightly lower performance than BERT in detecting long-term changes but performs significantly worse in detecting short-term changes.

(Chat)GPT v BERT: Dawn of Justice for Semantic Change Detection

TL;DR

This work addresses semantic change detection by evaluating off-the-shelf (Chat)GPT-3.5 against BERT on two diachronic WiC tasks: TempoWiC (short-term) and a newly introduced HistoWiC (long-term). The authors propose a controlled experimental framework with automatic prompts, varying in-context learning strategies, and a comparison across GPT API and ChatGPT Web, including a direct BERT baseline via layer-wise cosine-thresholding. Results show that GPT-3.5 generally underperforms BERT, particularly for short-term changes, though it shows relatively stronger performance on long-term historical change; API-based evaluation is more reliable than the web interface. The study highlights limitations of off-the-shelf ChatGPT for diachronic semantics and suggests that modern BERT-style embeddings remain robust baselines, while pointing to GPT-4 as a potential future improvement for lexical semantic change tasks.

Abstract

In the universe of Natural Language Processing, Transformer-based language models like BERT and (Chat)GPT have emerged as lexical superheroes with great power to solve open research problems. In this paper, we specifically focus on the temporal problem of semantic change, and evaluate their ability to solve two diachronic extensions of the Word-in-Context (WiC) task: TempoWiC and HistoWiC. In particular, we investigate the potential of a novel, off-the-shelf technology like ChatGPT (and GPT) 3.5 compared to BERT, which represents a family of models that currently stand as the state-of-the-art for modeling semantic change. Our experiments represent the first attempt to assess the use of (Chat)GPT for studying semantic change. Our results indicate that ChatGPT performs significantly worse than the foundational GPT version. Furthermore, our results demonstrate that (Chat)GPT achieves slightly lower performance than BERT in detecting long-term changes but performs significantly worse in detecting short-term changes.
Paper Structure (40 sections, 8 figures, 13 tables)

This paper contains 40 sections, 8 figures, 13 tables.

Figures (8)

  • Figure 1: The title of this paper draws inspiration by the movie Batman v Superman: Dawn of Justice. We leverage the analogy of (Chat)GPT and BERT, powerful and popular LMs, as two lexical superheroes often erroneously associated for solving similar problems. Our aim is to shed lights on the potential of (Chat)GPT for semantic change detection.
  • Figure 2: Average number of wrongly formatted answers (WFAs) over the temperature values considered. Background lines correspond to each experiment.
  • Figure 3: Performance of GPT API (Macro-F1) as temperature increases.
  • Figure 4: Performance of ChatGPT Web (Macro-F1). Temperature is unknown.
  • Figure 5: Comparison of BERT Performance (Macro-F1) for TempoWiC and HistoWiC tasks across layers
  • ...and 3 more figures