Table of Contents
Fetching ...

Output Length Effect on DeepSeek-R1's Safety in Forced Thinking

Xuying Li, Zhuo Li, Yuji Kosuga, Victor Bian

TL;DR

<3-5 sentence high-level summary>

Abstract

Large Language Models (LLMs) have demonstrated strong reasoning capabilities, but their safety under adversarial conditions remains a challenge. This study examines the impact of output length on the robustness of DeepSeek-R1, particularly in Forced Thinking scenarios. We analyze responses across various adversarial prompts and find that while longer outputs can improve safety through self-correction, certain attack types exploit extended generations. Our findings suggest that output length should be dynamically controlled to balance reasoning effectiveness and security. We propose reinforcement learning-based policy adjustments and adaptive token length regulation to enhance LLM safety.

Output Length Effect on DeepSeek-R1's Safety in Forced Thinking

TL;DR

<3-5 sentence high-level summary>

Abstract

Large Language Models (LLMs) have demonstrated strong reasoning capabilities, but their safety under adversarial conditions remains a challenge. This study examines the impact of output length on the robustness of DeepSeek-R1, particularly in Forced Thinking scenarios. We analyze responses across various adversarial prompts and find that while longer outputs can improve safety through self-correction, certain attack types exploit extended generations. Our findings suggest that output length should be dynamically controlled to balance reasoning effectiveness and security. We propose reinforcement learning-based policy adjustments and adaptive token length regulation to enhance LLM safety.

Paper Structure

This paper contains 22 sections, 9 equations, 4 figures.

Figures (4)

  • Figure 1: Relationship between token length and safety score
  • Figure 2: Impact of Output Length on Thinking Token Ratio and Safety Score Across Attacks
  • Figure 3: Impact of Output Length on Token Length and Safety Score Across Attacks
  • Figure :