Table of Contents
Fetching ...

Let LLMs Take on the Latest Challenges! A Chinese Dynamic Question Answering Benchmark

Zhikun Xu, Yinghui Li, Ruixue Ding, Xinyu Wang, Boli Chen, Yong Jiang, Hai-Tao Zheng, Wenlian Lu, Pengjun Xie, Fei Huang

TL;DR

This paper introduces CDQA, a Chinese Dynamic QA benchmark containing question-answer pairs related to the latest news on the Chinese Internet, and believes that the benchmark will become one of the key data resources for improving LLMs' Chinese question-answering ability in the future.

Abstract

How to better evaluate the capabilities of Large Language Models (LLMs) is the focal point and hot topic in current LLMs research. Previous work has noted that due to the extremely high cost of iterative updates of LLMs, they are often unable to answer the latest dynamic questions well. To promote the improvement of Chinese LLMs' ability to answer dynamic questions, in this paper, we introduce CDQA, a Chinese Dynamic QA benchmark containing question-answer pairs related to the latest news on the Chinese Internet. We obtain high-quality data through a pipeline that combines humans and models, and carefully classify the samples according to the frequency of answer changes to facilitate a more fine-grained observation of LLMs' capabilities. We have also evaluated and analyzed mainstream and advanced Chinese LLMs on CDQA. Extensive experiments and valuable insights suggest that our proposed CDQA is challenging and worthy of more further study. We believe that the benchmark we provide will become one of the key data resources for improving LLMs' Chinese question-answering ability in the future.

Let LLMs Take on the Latest Challenges! A Chinese Dynamic Question Answering Benchmark

TL;DR

This paper introduces CDQA, a Chinese Dynamic QA benchmark containing question-answer pairs related to the latest news on the Chinese Internet, and believes that the benchmark will become one of the key data resources for improving LLMs' Chinese question-answering ability in the future.

Abstract

How to better evaluate the capabilities of Large Language Models (LLMs) is the focal point and hot topic in current LLMs research. Previous work has noted that due to the extremely high cost of iterative updates of LLMs, they are often unable to answer the latest dynamic questions well. To promote the improvement of Chinese LLMs' ability to answer dynamic questions, in this paper, we introduce CDQA, a Chinese Dynamic QA benchmark containing question-answer pairs related to the latest news on the Chinese Internet. We obtain high-quality data through a pipeline that combines humans and models, and carefully classify the samples according to the frequency of answer changes to facilitate a more fine-grained observation of LLMs' capabilities. We have also evaluated and analyzed mainstream and advanced Chinese LLMs on CDQA. Extensive experiments and valuable insights suggest that our proposed CDQA is challenging and worthy of more further study. We believe that the benchmark we provide will become one of the key data resources for improving LLMs' Chinese question-answering ability in the future.
Paper Structure (29 sections, 14 figures, 7 tables)

This paper contains 29 sections, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Data Generation Pipeline for CDQA dataset. We first collect Chinese News from Internet and then extract entities from these news passages. Based on GPT-4, we generate synthetic queries from passages and corresponding entities. Manual annotation is conducted to verify the synthetic data and extra human-crafted queries, providing the verified queries, answers and supportive evidence links.
  • Figure 2: Our prompts are formulated under this framework. Different prompting methods are used with different instructions $\mathbf{i}$. The Chinese version is in Appendix \ref{['sec:translated_prompt']}.
  • Figure 3: F1-recall scores and Answer Rates of different prompts for LLMs in close-book scenario under zero-shot setting. We represent F1-recall scores with bar plots and answer rates with dotted lines.
  • Figure 4: F1-recall scores and Answer Rates of different prompts for LLMs in open-book scenario under zero-shot setting. We represent F1-recall scores with bar plots and answer rates with dotted lines.
  • Figure 5: F1-recall scores averaged over all three different questions for all models with different prompts in open-book scenario under zero-shot setting. We present F1-recall score only since all answer rates $\geq$ 90%.
  • ...and 9 more figures