Table of Contents
Fetching ...

CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain

Xin Tong, Bo Jin, Zhi Lin, Binjun Wang, Ting Yu, Qiang Cheng

TL;DR

This study aims to construct a specialized evaluation benchmark tailored to the Chinese public security domain—CPSDbench, and introduces a set of innovative evaluation metrics designed to more precisely quantify the efficacy of LLMs in executing tasks related to public security.

Abstract

Large Language Models (LLMs) have demonstrated significant potential and effectiveness across multiple application domains. To assess the performance of mainstream LLMs in public security tasks, this study aims to construct a specialized evaluation benchmark tailored to the Chinese public security domain--CPSDbench. CPSDbench integrates datasets related to public security collected from real-world scenarios, supporting a comprehensive assessment of LLMs across four key dimensions: text classification, information extraction, question answering, and text generation. Furthermore, this study introduces a set of innovative evaluation metrics designed to more precisely quantify the efficacy of LLMs in executing tasks related to public security. Through the in-depth analysis and evaluation conducted in this research, we not only enhance our understanding of the performance strengths and limitations of existing models in addressing public security issues but also provide references for the future development of more accurate and customized LLM models targeted at applications in this field.

CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain

TL;DR

This study aims to construct a specialized evaluation benchmark tailored to the Chinese public security domain—CPSDbench, and introduces a set of innovative evaluation metrics designed to more precisely quantify the efficacy of LLMs in executing tasks related to public security.

Abstract

Large Language Models (LLMs) have demonstrated significant potential and effectiveness across multiple application domains. To assess the performance of mainstream LLMs in public security tasks, this study aims to construct a specialized evaluation benchmark tailored to the Chinese public security domain--CPSDbench. CPSDbench integrates datasets related to public security collected from real-world scenarios, supporting a comprehensive assessment of LLMs across four key dimensions: text classification, information extraction, question answering, and text generation. Furthermore, this study introduces a set of innovative evaluation metrics designed to more precisely quantify the efficacy of LLMs in executing tasks related to public security. Through the in-depth analysis and evaluation conducted in this research, we not only enhance our understanding of the performance strengths and limitations of existing models in addressing public security issues but also provide references for the future development of more accurate and customized LLM models targeted at applications in this field.
Paper Structure (20 sections, 11 equations, 8 figures, 8 tables)

This paper contains 20 sections, 11 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: The evaluative content of CPSDBench and its correspondence to the police category in China. For various police categories, the CPSDBench benchmark testing framework has designed at least five types of tasks, aiming to achieve a comprehensive and specific assessment of each category, ensuring the evaluation is both holistic and targeted.
  • Figure 2: CPSDBench has designed appropriate prompts tailored to different tasks. Here, prompts for text classification and information extraction tasks are enumerated, primarily encompassing four key elements: role, task, input, and constraints.(Note that we used the Chinese version of the prompt during the evaluation process.)
  • Figure 3: A comparative analysis of the overall performance of mainstream LLMs on CPSDBench.For information extraction, the F1 score is used as the final score. The text generation task employs the BERT score as its final score. CLF refers to text classification tasks, IE denotes relation extraction tasks, QA represents question-answering tasks, and TG signifies text generation tasks.
  • Figure 4: The impact of text length on the F1 scores for LLMs. To avoid the occurrence of negative values in the violin plot due to the kernel density estimation process, a phased processing approach was adopted for visualization. Due to the poor performance of the Atom-1B model in information extraction tasks, it is not shown in the figure.
  • Figure 5: Error Type Distribution of LLMs in Sentiment Classification Task.
  • ...and 3 more figures