Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings

Zonghao Ying; Guangyi Zheng; Yongxin Huang; Deyue Zhang; Wenxin Zhang; Quanchen Zou; Aishan Liu; Xianglong Liu; Dacheng Tao

Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings

Zonghao Ying, Guangyi Zheng, Yongxin Huang, Deyue Zhang, Wenxin Zhang, Quanchen Zou, Aishan Liu, Xianglong Liu, Dacheng Tao

TL;DR

<3-5 sentence high-level summary> This paper conducts the first comprehensive safety evaluation of the DeepSeek model family, covering LLMs, MLLMs, and a T2I model, using a bilingual Chinese–English safety framework tailored to Chinese contexts. It introduces CNSafe, CNSafe_RT, SafeBench, MM-SafetyBench, and I2P benchmarks to quantify unsafe content generation across text and images, employing a hybrid human-(M)LLM judging approach. Key findings show substantial safety vulnerabilities across risk categories such as discriminatory and sexual content, with cross-lingual disparities and exposed chain-of-thought increasing $ASR$ under jailbreak conditions; T2I generation also exhibits high risk. The results highlight the need for stronger safety alignment, broader benchmarks, and iterative mitigation strategies, and the authors provide public code to facilitate replication and future work.

Abstract

This study presents the first comprehensive safety evaluation of the DeepSeek models, focusing on evaluating the safety risks associated with their generated content. Our evaluation encompasses DeepSeek's latest generation of large language models, multimodal large language models, and text-to-image models, systematically examining their performance regarding unsafe content generation. Notably, we developed a bilingual (Chinese-English) safety evaluation dataset tailored to Chinese sociocultural contexts, enabling a more thorough evaluation of the safety capabilities of Chinese-developed models. Experimental results indicate that despite their strong general capabilities, DeepSeek models exhibit significant safety vulnerabilities across multiple risk dimensions, including algorithmic discrimination and sexual content. These findings provide crucial insights for understanding and improving the safety of large foundation models. Our code is available at https://github.com/NY1024/DeepSeek-Safety-Eval.

Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings

TL;DR

Abstract

Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)