Table of Contents
Fetching ...

RRTL: Red Teaming Reasoning Large Language Models in Tool Learning

Yifei Liu, Yu Cui, Haibin Zhang

TL;DR

RRTL proposes a red-teaming framework to assess the safety of reasoning LLMs (RLLMs) during tool learning. It combines three evaluation modules—scenario-based safety evaluation, deceptive threats targeting tool calling, and Tool-CoT attacks—applied to six safety scenarios and seven RLLMs, with traditional LLMs as a benchmark. The study reveals that while RLLMs generally outperform traditional LLMs in safety, substantial cross-model and multilingual safety gaps persist, including high deception rates and vulnerability to tool-based CoT prompts. The findings highlight the need for robust defenses in tool learning and suggest that enhanced reasoning alone does not ensure security, especially across languages. These insights can guide the development of safer, more reliable RLLMs for tool-enabled reasoning tasks.

Abstract

While tool learning significantly enhances the capabilities of large language models (LLMs), it also introduces substantial security risks. Prior research has revealed various vulnerabilities in traditional LLMs during tool learning. However, the safety of newly emerging reasoning LLMs (RLLMs), such as DeepSeek-R1, in the context of tool learning remains underexplored. To bridge this gap, we propose RRTL, a red teaming approach specifically designed to evaluate RLLMs in tool learning. It integrates two novel strategies: (1) the identification of deceptive threats, which evaluates the model's behavior in concealing the usage of unsafe tools and their potential risks; and (2) the use of Chain-of-Thought (CoT) prompting to force tool invocation. Our approach also includes a benchmark for traditional LLMs. We conduct a comprehensive evaluation on seven mainstream RLLMs and uncover three key findings: (1) RLLMs generally achieve stronger safety performance than traditional LLMs, yet substantial safety disparities persist across models; (2) RLLMs can pose serious deceptive risks by frequently failing to disclose tool usage and to warn users of potential tool output risks; (3) CoT prompting reveals multi-lingual safety vulnerabilities in RLLMs. Our work provides important insights into enhancing the security of RLLMs in tool learning.

RRTL: Red Teaming Reasoning Large Language Models in Tool Learning

TL;DR

RRTL proposes a red-teaming framework to assess the safety of reasoning LLMs (RLLMs) during tool learning. It combines three evaluation modules—scenario-based safety evaluation, deceptive threats targeting tool calling, and Tool-CoT attacks—applied to six safety scenarios and seven RLLMs, with traditional LLMs as a benchmark. The study reveals that while RLLMs generally outperform traditional LLMs in safety, substantial cross-model and multilingual safety gaps persist, including high deception rates and vulnerability to tool-based CoT prompts. The findings highlight the need for robust defenses in tool learning and suggest that enhanced reasoning alone does not ensure security, especially across languages. These insights can guide the development of safer, more reliable RLLMs for tool-enabled reasoning tasks.

Abstract

While tool learning significantly enhances the capabilities of large language models (LLMs), it also introduces substantial security risks. Prior research has revealed various vulnerabilities in traditional LLMs during tool learning. However, the safety of newly emerging reasoning LLMs (RLLMs), such as DeepSeek-R1, in the context of tool learning remains underexplored. To bridge this gap, we propose RRTL, a red teaming approach specifically designed to evaluate RLLMs in tool learning. It integrates two novel strategies: (1) the identification of deceptive threats, which evaluates the model's behavior in concealing the usage of unsafe tools and their potential risks; and (2) the use of Chain-of-Thought (CoT) prompting to force tool invocation. Our approach also includes a benchmark for traditional LLMs. We conduct a comprehensive evaluation on seven mainstream RLLMs and uncover three key findings: (1) RLLMs generally achieve stronger safety performance than traditional LLMs, yet substantial safety disparities persist across models; (2) RLLMs can pose serious deceptive risks by frequently failing to disclose tool usage and to warn users of potential tool output risks; (3) CoT prompting reveals multi-lingual safety vulnerabilities in RLLMs. Our work provides important insights into enhancing the security of RLLMs in tool learning.

Paper Structure

This paper contains 25 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of RRTL approach. The user query and tool documentation are jointly provided to RLLMs. We conduct systematic red teaming of model responses on the benchmark using three evaluation components.
  • Figure 2: Overview of the impacts of tool calling in RLLMs under deceptive threats.
  • Figure 3: Three components of the Tool-CoT Attack.
  • Figure 4: The average ASR of RLLMs, traditional LLMs, and human annotators across different safety scenarios. Results for traditional LLMs and humans are adapted from ye-etal-2024-toolsword.
  • Figure 5: Deception Rate of RLLMs in tool calling for tools with potential safety risks.
  • ...and 1 more figures