Table of Contents
Fetching ...

Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation

Shangqing Zhao, Yuhao Zhou, Yupei Ren, Zhe Chen, Chenghao Jia, Fang Zhe, Zhaogaung Long, Shu Liu, Man Lan

TL;DR

Fùxì introduces a comprehensive benchmark to evaluate both understanding and generation of ancient Chinese text across 21 tasks, addressing a gap where prior benchmarks focused mainly on comprehension via multiple-choice. The framework combines rule-based metrics with a fine-tuned LLM evaluator to assess generation quality and cultural authenticity, with validation showing strong agreement with human judgments. Experiments across a wide range of open- and closed-source models reveal a persistent gap between comprehension and generation, and demonstrate that model size and training data influence performance more on knowledge-intensive tasks. The benchmark and toolkit are publicly released to catalyze research in ancient Chinese NLP and to guide the development of models better suited for processing classical Chinese literature and culture.

Abstract

Ancient Chinese text processing presents unique challenges for large language models (LLMs) due to its distinct linguistic features, complex structural constraints, and rich cultural context. While existing benchmarks have primarily focused on evaluating comprehension through multiple-choice questions, there remains a critical gap in assessing models' generative capabilities in classical Chinese. We introduce Fùxì, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks. Our benchmark distinguishes itself through three key contributions: (1) balanced coverage of both comprehension and generation tasks, including novel tasks like poetry composition and couplet completion, (2) specialized evaluation metrics designed specifically for classical Chinese text generation, combining rule-based verification with fine-tuned LLM evaluators, and (3) a systematic assessment framework that considers both linguistic accuracy and cultural authenticity. Through extensive evaluation of state-of-the-art LLMs, we reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks, particularly those requiring deep cultural knowledge and adherence to classical formats. Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development. The benchmark, evaluation toolkit, and baseline results are publicly available to facilitate research in this domain.

Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation

TL;DR

Fùxì introduces a comprehensive benchmark to evaluate both understanding and generation of ancient Chinese text across 21 tasks, addressing a gap where prior benchmarks focused mainly on comprehension via multiple-choice. The framework combines rule-based metrics with a fine-tuned LLM evaluator to assess generation quality and cultural authenticity, with validation showing strong agreement with human judgments. Experiments across a wide range of open- and closed-source models reveal a persistent gap between comprehension and generation, and demonstrate that model size and training data influence performance more on knowledge-intensive tasks. The benchmark and toolkit are publicly released to catalyze research in ancient Chinese NLP and to guide the development of models better suited for processing classical Chinese literature and culture.

Abstract

Ancient Chinese text processing presents unique challenges for large language models (LLMs) due to its distinct linguistic features, complex structural constraints, and rich cultural context. While existing benchmarks have primarily focused on evaluating comprehension through multiple-choice questions, there remains a critical gap in assessing models' generative capabilities in classical Chinese. We introduce Fùxì, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks. Our benchmark distinguishes itself through three key contributions: (1) balanced coverage of both comprehension and generation tasks, including novel tasks like poetry composition and couplet completion, (2) specialized evaluation metrics designed specifically for classical Chinese text generation, combining rule-based verification with fine-tuned LLM evaluators, and (3) a systematic assessment framework that considers both linguistic accuracy and cultural authenticity. Through extensive evaluation of state-of-the-art LLMs, we reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks, particularly those requiring deep cultural knowledge and adherence to classical formats. Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development. The benchmark, evaluation toolkit, and baseline results are publicly available to facilitate research in this domain.

Paper Structure

This paper contains 23 sections, 26 figures, 10 tables.

Figures (26)

  • Figure 1: Interactions with GPT-4 about Ancient Chinese related topics.
  • Figure 2: Model performance scaling with model size on four task categories.
  • Figure 3: Comparison of Accuracy and Cohen's Kappa across different labels. Overall performance: Accuracy = 0.898, Cohen's Kappa = 0.764.
  • Figure 4: Prompt for the LLM Evaluator.
  • Figure 5: Example for task: Ancient Chinese RC (Part 1).
  • ...and 21 more figures