Safety Evaluation of DeepSeek Models in Chinese Contexts
Wenjing Zhang, Xuejiao Lei, Zhaoxiang Liu, Ning Wang, Zhenhong Long, Peijun Yang, Jiaojiao Zhao, Minjie Hua, Chaoyang Ma, Kai Wang, Shiguo Lian
TL;DR
This paper addresses the gap in Chinese-context safety evaluation for DeepSeek models by introducing CHiSafetyBench, a benchmark aligned with national safety standards. It systematically assesses DeepSeek-R1 and DeepSeek-V3 against 10 auxiliary Chinese-capable LLMs across five safety domains using risk-content identification and refusal-to-answer tasks, reporting metrics such as ACC, RR-1, RR-2, and HR. The findings reveal notable weaknesses in discrimination and refusal capabilities, with DeepSeek models lagging behind top performers in several categories, and case studies illustrating safety gaps in practice. The work provides a Chinese-context safety baseline and highlights the need for ongoing benchmark refinement to guide future safety improvements of DeepSeek models and related systems.
Abstract
Recently, the DeepSeek series of models, leveraging their exceptional reasoning capabilities and open-source strategy, is reshaping the global AI landscape. Despite these advantages, they exhibit significant safety deficiencies. Research conducted by Robust Intelligence, a subsidiary of Cisco, in collaboration with the University of Pennsylvania, revealed that DeepSeek-R1 has a 100\% attack success rate when processing harmful prompts. Additionally, multiple safety companies and research institutions have confirmed critical safety vulnerabilities in this model. As models demonstrating robust performance in Chinese and English, DeepSeek models require equally crucial safety assessments in both language contexts. However, current research has predominantly focused on safety evaluations in English environments, leaving a gap in comprehensive assessments of their safety performance in Chinese contexts. In response to this gap, this study introduces CHiSafetyBench, a Chinese-specific safety evaluation benchmark. This benchmark systematically evaluates the safety of DeepSeek-R1 and DeepSeek-V3 in Chinese contexts, revealing their performance across safety categories. The experimental results quantify the deficiencies of these two models in Chinese contexts, providing key insights for subsequent improvements. It should be noted that, despite our efforts to establish a comprehensive, objective, and authoritative evaluation benchmark, the selection of test samples, characteristics of data distribution, and the setting of evaluation criteria may inevitably introduce certain biases into the evaluation results. We will continuously optimize the evaluation benchmark and periodically update this report to provide more comprehensive and accurate assessment outcomes. Please refer to the latest version of the paper for the most recent evaluation results and conclusions.
