NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

Miao Li; Ming-Bin Chen; Bo Tang; Shengbin Hou; Pengyu Wang; Haiying Deng; Zhiyu Li; Feiyu Xiong; Keming Mao; Peng Cheng; Yi Luo

NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

Miao Li, Ming-Bin Chen, Bo Tang, Shengbin Hou, Pengyu Wang, Haiying Deng, Zhiyu Li, Feiyu Xiong, Keming Mao, Peng Cheng, Yi Luo

TL;DR

NewsBench provides a domain-specific evaluation framework to assess large language models' editorial capabilities in Chinese journalism, targeting both writing quality and safety adherence. It introduces a 1,267-sample benchmark spanning five editorial tasks and 24 domains, along with two GPT-4–based evaluation protocols and human validation, to systematically compare 11 LLMs. The study finds GPT-4–1106 and ERNIE Bot as top performers across tasks, but reveals persistent safety challenges in creative writing and a limited correlation between scale and performance. This framework offers a reusable, publication-ready standard to align LLM outputs with journalistic ethics and safety, guiding future improvements in Chinese-language editorial AI systems.

Abstract

We present NewsBench, a novel evaluation framework to systematically assess the capabilities of Large Language Models (LLMs) for editorial capabilities in Chinese journalism. Our constructed benchmark dataset is focused on four facets of writing proficiency and six facets of safety adherence, and it comprises manually and carefully designed 1,267 test samples in the types of multiple choice questions and short answer questions for five editorial tasks in 24 news domains. To measure performances, we propose different GPT-4 based automatic evaluation protocols to assess LLM generations for short answer questions in terms of writing proficiency and safety adherence, and both are validated by the high correlations with human evaluations. Based on the systematic evaluation framework, we conduct a comprehensive analysis of ten popular LLMs which can handle Chinese. The experimental results highlight GPT-4 and ERNIE Bot as top performers, yet reveal a relative deficiency in journalistic safety adherence in creative writing tasks. Our findings also underscore the need for enhanced ethical guidance in machine-generated journalistic content, marking a step forward in aligning LLMs with journalistic standards and safety considerations.

NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

TL;DR

Abstract

Paper Structure (20 sections, 1 figure, 24 tables)

This paper contains 20 sections, 1 figure, 24 tables.

Introduction
Related Work
The Evaluation Framework
Evaluation Facets for Writing and Safety
Question Types of Test Samples
Benchmark Dataset Construction
Prompt Formats for Test Samples
Dataset Construction by Human Experts
Dataset Statistics and Features
Evaluation Protocols for Short Answer Questions
Protocols for Writing Proficiency
Protocols for Safety Adherence
Human Validation of GPT-4 Scores
Systematic Evaluations of LLMs
Experimental Settings
...and 5 more sections

Figures (1)

Figure 1: The key components and processes to evaluate editorial capabilities of an LLM with our evaluation framework, NewsBench. The numbers inside the brackets indicate the number of test samples that we construct for each group of evaluations. The bold border boxes are the overall scores for Short Answer Questions (SAQs) and Multiple Choice Questions (MCQs) on Safety Adherence (SA) and Journalistic Writing Proficiency (JWP), respectively.

NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

TL;DR

Abstract

NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

Authors

TL;DR

Abstract

Table of Contents

Figures (1)