ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
Junjie Ye, Sixian Li, Guanyu Li, Caishuang Huang, Songyang Gao, Yilong Wu, Qi Zhang, Tao Gui, Xuanjing Huang
TL;DR
ToolSword presents a three-stage, six-scenario framework to expose safety issues in LLMs during tool learning (input, execution, output). The study evaluates 11 LLMs across MQ/JA, NM/RC, and HF/EC, revealing persistent vulnerabilities even in advanced models like GPT-4. Key findings show LLMs struggle to reject harmful queries, misselect tools under noise or risk cues, and fail to critically evaluate tool feedback, indicating a misalignment between safety mechanisms and tool-driven workflows. The work provides a dataset and methodological blueprint for future research to strengthen safety alignment in tool learning and suggests targeted training and multi-agent approaches as promising directions.
Abstract
Tool learning is widely acknowledged as a foundational approach or deploying large language models (LLMs) in real-world scenarios. While current research primarily emphasizes leveraging tools to augment LLMs, it frequently neglects emerging safety considerations tied to their application. To fill this gap, we present *ToolSword*, a comprehensive framework dedicated to meticulously investigating safety issues linked to LLMs in tool learning. Specifically, ToolSword delineates six safety scenarios for LLMs in tool learning, encompassing **malicious queries** and **jailbreak attacks** in the input stage, **noisy misdirection** and **risky cues** in the execution stage, and **harmful feedback** and **error conflicts** in the output stage. Experiments conducted on 11 open-source and closed-source LLMs reveal enduring safety challenges in tool learning, such as handling harmful queries, employing risky tools, and delivering detrimental feedback, which even GPT-4 is susceptible to. Moreover, we conduct further studies with the aim of fostering research on tool learning safety. The data is released in https://github.com/Junjie-Ye/ToolSword.
