Safety Evaluation and Enhancement of DeepSeek Models in Chinese Contexts

Wenjing Zhang; Xuejiao Lei; Zhaoxiang Liu; Limin Han; Jiaojiao Zhao; Junting Guo; Zhenhong Long; Shu Yang; Meijuan An; Beibei Huang; Rongjia Du; Ning Wang; Kai Wang; Shiguo Lian

Safety Evaluation and Enhancement of DeepSeek Models in Chinese Contexts

Wenjing Zhang, Xuejiao Lei, Zhaoxiang Liu, Limin Han, Jiaojiao Zhao, Junting Guo, Zhenhong Long, Shu Yang, Meijuan An, Beibei Huang, Rongjia Du, Ning Wang, Kai Wang, Shiguo Lian

TL;DR

The paper systematically evaluates the Chinese safety performance of the DeepSeek-R1 family, including six distilled variants, using the CHiSafetyBench benchmark and examining how distillation affects risk-content identification and refusal to answer. It then implements a safety-enhancement pipeline based on full-parameter supervised fine-tuning with a ~50K-sample dataset that includes safety instructions and chain-of-thought data to balance safety gains with reasoning preservation. Results show that distillation often degrades safety across multiple domains, while the safety-enhancement process yields substantial improvements in ACC and refusal-related metrics with minimal or no loss in reasoning, and in many cases surpasses the base models’ safety levels. The authors also open-source the safety-enhanced DeepSeek-R1 models, providing a practical resource for researchers and developers to study and improve safety in Chinese-language LLMs.

Abstract

DeepSeek-R1, renowned for its exceptional reasoning capabilities and open-source strategy, is significantly influencing the global artificial intelligence landscape. However, it exhibits notable safety shortcomings. Recent research conducted by Robust Intelligence, a subsidiary of Cisco, in collaboration with the University of Pennsylvania, revealed that DeepSeek-R1 achieves a 100\% attack success rate when processing harmful prompts. Furthermore, multiple security firms and research institutions have identified critical security vulnerabilities within the model. Although China Unicom has uncovered safety vulnerabilities of R1 in Chinese contexts, the safety capabilities of the remaining distilled models in the R1 series have not yet been comprehensively evaluated. To address this gap, this study utilizes the comprehensive Chinese safety benchmark CHiSafetyBench to conduct an in-depth safety evaluation of the DeepSeek-R1 series distilled models. The objective is to assess the safety capabilities of these models in Chinese contexts both before and after distillation, and to further elucidate the adverse effects of distillation on model safety. Building on these findings, we implement targeted safety enhancements for the entire DeepSeek-R1 model series. Evaluation results indicate that the enhanced models achieve significant improvements in safety while maintaining reasoning capabilities without notable degradation. We open-source the safety-enhanced models at https://github.com/UnicomAI/DeepSeek-R1-Safe to serve as a valuable resource for future research and optimization of DeepSeek models.

Safety Evaluation and Enhancement of DeepSeek Models in Chinese Contexts

TL;DR

Abstract

Safety Evaluation and Enhancement of DeepSeek Models in Chinese Contexts

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)