Table of Contents
Fetching ...

NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates

Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Min Zhang, Zhaopeng Tu

TL;DR

This work proposes an adaptive benchmark, NewTerm, for real-time evaluation of new terms, and designs a highly automated construction method to ensure high-quality benchmark construction with minimal human effort, allowing flexible updates for real-time information.

Abstract

Despite their remarkable abilities in various tasks, large language models (LLMs) still struggle with real-time information (e.g., new facts and terms) due to the knowledge cutoff in their development process. However, existing benchmarks focus on outdated content and limited fields, facing difficulties in real-time updating and leaving new terms unexplored. To address this problem, we propose an adaptive benchmark, NewTerm, for real-time evaluation of new terms. We design a highly automated construction method to ensure high-quality benchmark construction with minimal human effort, allowing flexible updates for real-time information. Empirical results on various LLMs demonstrate over 20% performance reduction caused by new terms. Additionally, while updates to the knowledge cutoff of LLMs can cover some of the new terms, they are unable to generalize to more distant new terms. We also analyze which types of terms are more challenging and why LLMs struggle with new terms, paving the way for future research. Finally, we construct NewTerm 2022 and 2023 to evaluate the new terms updated each year and will continue updating annually. The benchmark and codes can be found at https://github.com/hexuandeng/NewTerm.

NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates

TL;DR

This work proposes an adaptive benchmark, NewTerm, for real-time evaluation of new terms, and designs a highly automated construction method to ensure high-quality benchmark construction with minimal human effort, allowing flexible updates for real-time information.

Abstract

Despite their remarkable abilities in various tasks, large language models (LLMs) still struggle with real-time information (e.g., new facts and terms) due to the knowledge cutoff in their development process. However, existing benchmarks focus on outdated content and limited fields, facing difficulties in real-time updating and leaving new terms unexplored. To address this problem, we propose an adaptive benchmark, NewTerm, for real-time evaluation of new terms. We design a highly automated construction method to ensure high-quality benchmark construction with minimal human effort, allowing flexible updates for real-time information. Empirical results on various LLMs demonstrate over 20% performance reduction caused by new terms. Additionally, while updates to the knowledge cutoff of LLMs can cover some of the new terms, they are unable to generalize to more distant new terms. We also analyze which types of terms are more challenging and why LLMs struggle with new terms, paving the way for future research. Finally, we construct NewTerm 2022 and 2023 to evaluate the new terms updated each year and will continue updating annually. The benchmark and codes can be found at https://github.com/hexuandeng/NewTerm.

Paper Structure

This paper contains 104 sections, 10 figures, 21 tables.

Figures (10)

  • Figure 1: The framework for constructing the benchmark based on real-time new terms from the dictionary.
  • Figure 2: The construction pipeline for NewTerm benchmark. We use different colors to indicate the parts used by each task, with COMA, COST, and CSJ represented by green, yellow, and red.
  • Figure 3: Examples of three open-domain NLU tasks in NewTerm benchmark. The choice with a checkmark is correct, while the choice with the ChatGPT icon is the one ChatGPT incorrectly selects under zero-shot settings. The underlined word is the new term.
  • Figure 4: (Left) Performance of LLMs on different terms under Base setting in NewTerm 2022. The dashed line in the middle figure represents the average accuracy of GPT-4. (Right) The overlap of learned terms selected by each series of models in NewTerm 2022 and 2023.
  • Figure 5: The performance of ChatGPT for different types of new terms. Orange columns represent frequency, green represents deducing difficulty, and purple represents Word/Phrase. The lower dashed line represents the average score for Base setting, while the higher one represents Gold setting.
  • ...and 5 more figures