ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages

Junjie Ye; Sixian Li; Guanyu Li; Caishuang Huang; Songyang Gao; Yilong Wu; Qi Zhang; Tao Gui; Xuanjing Huang

ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages

Junjie Ye, Sixian Li, Guanyu Li, Caishuang Huang, Songyang Gao, Yilong Wu, Qi Zhang, Tao Gui, Xuanjing Huang

TL;DR

ToolSword presents a three-stage, six-scenario framework to expose safety issues in LLMs during tool learning (input, execution, output). The study evaluates 11 LLMs across MQ/JA, NM/RC, and HF/EC, revealing persistent vulnerabilities even in advanced models like GPT-4. Key findings show LLMs struggle to reject harmful queries, misselect tools under noise or risk cues, and fail to critically evaluate tool feedback, indicating a misalignment between safety mechanisms and tool-driven workflows. The work provides a dataset and methodological blueprint for future research to strengthen safety alignment in tool learning and suggests targeted training and multi-agent approaches as promising directions.

Abstract

Tool learning is widely acknowledged as a foundational approach or deploying large language models (LLMs) in real-world scenarios. While current research primarily emphasizes leveraging tools to augment LLMs, it frequently neglects emerging safety considerations tied to their application. To fill this gap, we present *ToolSword*, a comprehensive framework dedicated to meticulously investigating safety issues linked to LLMs in tool learning. Specifically, ToolSword delineates six safety scenarios for LLMs in tool learning, encompassing **malicious queries** and **jailbreak attacks** in the input stage, **noisy misdirection** and **risky cues** in the execution stage, and **harmful feedback** and **error conflicts** in the output stage. Experiments conducted on 11 open-source and closed-source LLMs reveal enduring safety challenges in tool learning, such as handling harmful queries, employing risky tools, and delivering detrimental feedback, which even GPT-4 is susceptible to. Moreover, we conduct further studies with the aim of fostering research on tool learning safety. The data is released in https://github.com/Junjie-Ye/ToolSword.

ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages

TL;DR

Abstract

Paper Structure (46 sections, 6 figures, 27 tables)

This paper contains 46 sections, 6 figures, 27 tables.

Introduction
ToolSword
Safety Scenarios in the Input Stage
Malicious Queries (MQ)
Jailbreak Attacks (JA)
Safety Scenarios in the Execution Stage
Noise Misdirection (NM)
Risky Cues (RC)
Safety Scenarios in the Output Stage
Harmful Feedback (HF)
Error Conflicts (EC)
Experiments
Model Selection
Experimental Setup
Results in the Input Stage
...and 31 more sections

Figures (6)

Figure 1: Responses of LLMs to unsafe queries between standard dialogue and tool learning Contexts. Tool learning may disrupt the safe alignment mechanism of LLMs, leading to responses to unsafe queries through tool invocation.
Figure 2: Framework of ToolSword. ToolSword offers a comprehensive analysis of the safety challenges encountered by LLMs during tool learning, spanning three distinct stages: input, execution, and output. Within each stage, we have devised two safety scenarios, providing a thorough exploration of the real-world situations LLMs may encounter while utilizing the tool.
Figure 3: ASR of GPT family of models in various scenarios in both standard dialogue and tool learning contexts.
Figure 4: The tool selection error rate for various LLMs in environments with and without noise.
Figure 5: Probability of information output by various LLMs for different positions.
...and 1 more figures

ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages

TL;DR

Abstract

ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages

Authors

TL;DR

Abstract

Table of Contents

Figures (6)