Table of Contents
Fetching ...

ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages

Junjie Ye, Sixian Li, Guanyu Li, Caishuang Huang, Songyang Gao, Yilong Wu, Qi Zhang, Tao Gui, Xuanjing Huang

TL;DR

ToolSword presents a three-stage, six-scenario framework to expose safety issues in LLMs during tool learning (input, execution, output). The study evaluates 11 LLMs across MQ/JA, NM/RC, and HF/EC, revealing persistent vulnerabilities even in advanced models like GPT-4. Key findings show LLMs struggle to reject harmful queries, misselect tools under noise or risk cues, and fail to critically evaluate tool feedback, indicating a misalignment between safety mechanisms and tool-driven workflows. The work provides a dataset and methodological blueprint for future research to strengthen safety alignment in tool learning and suggests targeted training and multi-agent approaches as promising directions.

Abstract

Tool learning is widely acknowledged as a foundational approach or deploying large language models (LLMs) in real-world scenarios. While current research primarily emphasizes leveraging tools to augment LLMs, it frequently neglects emerging safety considerations tied to their application. To fill this gap, we present *ToolSword*, a comprehensive framework dedicated to meticulously investigating safety issues linked to LLMs in tool learning. Specifically, ToolSword delineates six safety scenarios for LLMs in tool learning, encompassing **malicious queries** and **jailbreak attacks** in the input stage, **noisy misdirection** and **risky cues** in the execution stage, and **harmful feedback** and **error conflicts** in the output stage. Experiments conducted on 11 open-source and closed-source LLMs reveal enduring safety challenges in tool learning, such as handling harmful queries, employing risky tools, and delivering detrimental feedback, which even GPT-4 is susceptible to. Moreover, we conduct further studies with the aim of fostering research on tool learning safety. The data is released in https://github.com/Junjie-Ye/ToolSword.

ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages

TL;DR

ToolSword presents a three-stage, six-scenario framework to expose safety issues in LLMs during tool learning (input, execution, output). The study evaluates 11 LLMs across MQ/JA, NM/RC, and HF/EC, revealing persistent vulnerabilities even in advanced models like GPT-4. Key findings show LLMs struggle to reject harmful queries, misselect tools under noise or risk cues, and fail to critically evaluate tool feedback, indicating a misalignment between safety mechanisms and tool-driven workflows. The work provides a dataset and methodological blueprint for future research to strengthen safety alignment in tool learning and suggests targeted training and multi-agent approaches as promising directions.

Abstract

Tool learning is widely acknowledged as a foundational approach or deploying large language models (LLMs) in real-world scenarios. While current research primarily emphasizes leveraging tools to augment LLMs, it frequently neglects emerging safety considerations tied to their application. To fill this gap, we present *ToolSword*, a comprehensive framework dedicated to meticulously investigating safety issues linked to LLMs in tool learning. Specifically, ToolSword delineates six safety scenarios for LLMs in tool learning, encompassing **malicious queries** and **jailbreak attacks** in the input stage, **noisy misdirection** and **risky cues** in the execution stage, and **harmful feedback** and **error conflicts** in the output stage. Experiments conducted on 11 open-source and closed-source LLMs reveal enduring safety challenges in tool learning, such as handling harmful queries, employing risky tools, and delivering detrimental feedback, which even GPT-4 is susceptible to. Moreover, we conduct further studies with the aim of fostering research on tool learning safety. The data is released in https://github.com/Junjie-Ye/ToolSword.
Paper Structure (46 sections, 6 figures, 27 tables)

This paper contains 46 sections, 6 figures, 27 tables.

Figures (6)

  • Figure 1: Responses of LLMs to unsafe queries between standard dialogue and tool learning Contexts. Tool learning may disrupt the safe alignment mechanism of LLMs, leading to responses to unsafe queries through tool invocation.
  • Figure 2: Framework of ToolSword. ToolSword offers a comprehensive analysis of the safety challenges encountered by LLMs during tool learning, spanning three distinct stages: input, execution, and output. Within each stage, we have devised two safety scenarios, providing a thorough exploration of the real-world situations LLMs may encounter while utilizing the tool.
  • Figure 3: ASR of GPT family of models in various scenarios in both standard dialogue and tool learning contexts.
  • Figure 4: The tool selection error rate for various LLMs in environments with and without noise.
  • Figure 5: Probability of information output by various LLMs for different positions.
  • ...and 1 more figures