Table of Contents
Fetching ...

CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models

Yizhi LI, Ge Zhang, Xingwei Qu, Jiali Li, Zhaoqun Li, Zekun Wang, Hao Li, Ruibin Yuan, Yinghao Ma, Kai Zhang, Wangchunshu Zhou, Yiming Liang, Lei Zhang, Lei Ma, Jiajun Zhang, Zuowen Li, Stephen W. Huang, Chenghua Lin, Jie Fu

TL;DR

CIF-Bench addresses the challenge of evaluating zero-shot generalizability of LLMs to Chinese by introducing a diversified, language-focused instruction-following benchmark. It assembles one hundred fifty tasks across twenty categories with both public and private data splits and five instruction variations per task, coupled with a model-based automatic evaluation pipeline (GPT-4 for classification and generation and BLEURT for semantic similarity). The study analyzes twenty-eight LLMs, revealing a substantial generalization gap in Chinese instruction-following (best around fifty-two point nine percent) and highlighting data leakage and transferability issues. By exposing current limitations and providing a robust evaluation framework, CIF-Bench aims to spur development of more adaptable, culturally aware, and linguistically diverse language models for global NLP tasks.

Abstract

The advancement of large language models (LLMs) has enhanced the ability to generalize across a wide range of unseen natural language processing (NLP) tasks through instruction-following. Yet, their effectiveness often diminishes in low-resource languages like Chinese, exacerbated by biased evaluations from data leakage, casting doubt on their true generalizability to new linguistic territories. In response, we introduce the Chinese Instruction-Following Benchmark (CIF-Bench), designed to evaluate the zero-shot generalizability of LLMs to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances across 20 categories. To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance, totaling 45,000 data instances. Our evaluation of 28 selected LLMs reveals a noticeable performance gap, with the best model scoring only 52.9%, highlighting the limitations of LLMs in less familiar language and task contexts. This work not only uncovers the current limitations of LLMs in handling Chinese language tasks but also sets a new standard for future LLM generalizability research, pushing towards the development of more adaptable, culturally informed, and linguistically diverse models.

CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models

TL;DR

CIF-Bench addresses the challenge of evaluating zero-shot generalizability of LLMs to Chinese by introducing a diversified, language-focused instruction-following benchmark. It assembles one hundred fifty tasks across twenty categories with both public and private data splits and five instruction variations per task, coupled with a model-based automatic evaluation pipeline (GPT-4 for classification and generation and BLEURT for semantic similarity). The study analyzes twenty-eight LLMs, revealing a substantial generalization gap in Chinese instruction-following (best around fifty-two point nine percent) and highlighting data leakage and transferability issues. By exposing current limitations and providing a robust evaluation framework, CIF-Bench aims to spur development of more adaptable, culturally aware, and linguistically diverse language models for global NLP tasks.

Abstract

The advancement of large language models (LLMs) has enhanced the ability to generalize across a wide range of unseen natural language processing (NLP) tasks through instruction-following. Yet, their effectiveness often diminishes in low-resource languages like Chinese, exacerbated by biased evaluations from data leakage, casting doubt on their true generalizability to new linguistic territories. In response, we introduce the Chinese Instruction-Following Benchmark (CIF-Bench), designed to evaluate the zero-shot generalizability of LLMs to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances across 20 categories. To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance, totaling 45,000 data instances. Our evaluation of 28 selected LLMs reveals a noticeable performance gap, with the best model scoring only 52.9%, highlighting the limitations of LLMs in less familiar language and task contexts. This work not only uncovers the current limitations of LLMs in handling Chinese language tasks but also sets a new standard for future LLM generalizability research, pushing towards the development of more adaptable, culturally informed, and linguistically diverse models.
Paper Structure (45 sections, 1 equation, 3 figures, 13 tables)

This paper contains 45 sections, 1 equation, 3 figures, 13 tables.

Figures (3)

  • Figure 1: A large language model can tackle English task translated to Chinese, but fail to respond to instruction originally in Chinese.
  • Figure 2: Task Category Distribution in CIF-Bench. The radii have three groups, determined by the number of tasks contained ($\leq10$, $\leq20$, and $>20$).
  • Figure 3: An Exemplar Prompt for GPT-4 Evaluator for the Task "Chinese Rhetoric Detection".