CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain

Xin Tong; Bo Jin; Zhi Lin; Binjun Wang; Ting Yu; Qiang Cheng

CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain

Xin Tong, Bo Jin, Zhi Lin, Binjun Wang, Ting Yu, Qiang Cheng

TL;DR

This study aims to construct a specialized evaluation benchmark tailored to the Chinese public security domain—CPSDbench, and introduces a set of innovative evaluation metrics designed to more precisely quantify the efficacy of LLMs in executing tasks related to public security.

Abstract

Large Language Models (LLMs) have demonstrated significant potential and effectiveness across multiple application domains. To assess the performance of mainstream LLMs in public security tasks, this study aims to construct a specialized evaluation benchmark tailored to the Chinese public security domain--CPSDbench. CPSDbench integrates datasets related to public security collected from real-world scenarios, supporting a comprehensive assessment of LLMs across four key dimensions: text classification, information extraction, question answering, and text generation. Furthermore, this study introduces a set of innovative evaluation metrics designed to more precisely quantify the efficacy of LLMs in executing tasks related to public security. Through the in-depth analysis and evaluation conducted in this research, we not only enhance our understanding of the performance strengths and limitations of existing models in addressing public security issues but also provide references for the future development of more accurate and customized LLM models targeted at applications in this field.

CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain

TL;DR

Abstract

Paper Structure (20 sections, 11 equations, 8 figures, 8 tables)

This paper contains 20 sections, 11 equations, 8 figures, 8 tables.

Introduction
Related Work
Large Language Model
Application and Evaluation Benchmarks of LLMs
Method
Tasks and Datasets
Text Classification
Information Extraction
Question Answering
Text Generation
Baselines and Prompt
Evaluation Metrics
Results and Analysis
Performance Evaluation
Error Analysis
...and 5 more sections

Figures (8)

Figure 1: The evaluative content of CPSDBench and its correspondence to the police category in China. For various police categories, the CPSDBench benchmark testing framework has designed at least five types of tasks, aiming to achieve a comprehensive and specific assessment of each category, ensuring the evaluation is both holistic and targeted.
Figure 2: CPSDBench has designed appropriate prompts tailored to different tasks. Here, prompts for text classification and information extraction tasks are enumerated, primarily encompassing four key elements: role, task, input, and constraints.(Note that we used the Chinese version of the prompt during the evaluation process.)
Figure 3: A comparative analysis of the overall performance of mainstream LLMs on CPSDBench.For information extraction, the F1 score is used as the final score. The text generation task employs the BERT score as its final score. CLF refers to text classification tasks, IE denotes relation extraction tasks, QA represents question-answering tasks, and TG signifies text generation tasks.
Figure 4: The impact of text length on the F1 scores for LLMs. To avoid the occurrence of negative values in the violin plot due to the kernel density estimation process, a phased processing approach was adopted for visualization. Due to the poor performance of the Atom-1B model in information extraction tasks, it is not shown in the figure.
Figure 5: Error Type Distribution of LLMs in Sentiment Classification Task.
...and 3 more figures

CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain

TL;DR

Abstract

CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain

Authors

TL;DR

Abstract

Table of Contents

Figures (8)