RRTL: Red Teaming Reasoning Large Language Models in Tool Learning
Yifei Liu, Yu Cui, Haibin Zhang
TL;DR
RRTL proposes a red-teaming framework to assess the safety of reasoning LLMs (RLLMs) during tool learning. It combines three evaluation modules—scenario-based safety evaluation, deceptive threats targeting tool calling, and Tool-CoT attacks—applied to six safety scenarios and seven RLLMs, with traditional LLMs as a benchmark. The study reveals that while RLLMs generally outperform traditional LLMs in safety, substantial cross-model and multilingual safety gaps persist, including high deception rates and vulnerability to tool-based CoT prompts. The findings highlight the need for robust defenses in tool learning and suggest that enhanced reasoning alone does not ensure security, especially across languages. These insights can guide the development of safer, more reliable RLLMs for tool-enabled reasoning tasks.
Abstract
While tool learning significantly enhances the capabilities of large language models (LLMs), it also introduces substantial security risks. Prior research has revealed various vulnerabilities in traditional LLMs during tool learning. However, the safety of newly emerging reasoning LLMs (RLLMs), such as DeepSeek-R1, in the context of tool learning remains underexplored. To bridge this gap, we propose RRTL, a red teaming approach specifically designed to evaluate RLLMs in tool learning. It integrates two novel strategies: (1) the identification of deceptive threats, which evaluates the model's behavior in concealing the usage of unsafe tools and their potential risks; and (2) the use of Chain-of-Thought (CoT) prompting to force tool invocation. Our approach also includes a benchmark for traditional LLMs. We conduct a comprehensive evaluation on seven mainstream RLLMs and uncover three key findings: (1) RLLMs generally achieve stronger safety performance than traditional LLMs, yet substantial safety disparities persist across models; (2) RLLMs can pose serious deceptive risks by frequently failing to disclose tool usage and to warn users of potential tool output risks; (3) CoT prompting reveals multi-lingual safety vulnerabilities in RLLMs. Our work provides important insights into enhancing the security of RLLMs in tool learning.
