Table of Contents
Fetching ...

Defining and Evaluating Physical Safety for Large Language Models

Yung-Chen Tang, Pin-Yu Chen, Tsung-Yi Ho

TL;DR

This study addresses the critical gap in evaluating LLM physical safety by developing a comprehensive benchmark for drone control and finds that larger models demonstrate better safety capabilities, particularly in refusing dangerous commands.

Abstract

Large Language Models (LLMs) are increasingly used to control robotic systems such as drones, but their risks of causing physical threats and harm in real-world applications remain unexplored. Our study addresses the critical gap in evaluating LLM physical safety by developing a comprehensive benchmark for drone control. We classify the physical safety risks of drones into four categories: (1) human-targeted threats, (2) object-targeted threats, (3) infrastructure attacks, and (4) regulatory violations. Our evaluation of mainstream LLMs reveals an undesirable trade-off between utility and safety, with models that excel in code generation often performing poorly in crucial safety aspects. Furthermore, while incorporating advanced prompt engineering techniques such as In-Context Learning and Chain-of-Thought can improve safety, these methods still struggle to identify unintentional attacks. In addition, larger models demonstrate better safety capabilities, particularly in refusing dangerous commands. Our findings and benchmark can facilitate the design and evaluation of physical safety for LLMs. The project page is available at huggingface.co/spaces/TrustSafeAI/LLM-physical-safety.

Defining and Evaluating Physical Safety for Large Language Models

TL;DR

This study addresses the critical gap in evaluating LLM physical safety by developing a comprehensive benchmark for drone control and finds that larger models demonstrate better safety capabilities, particularly in refusing dangerous commands.

Abstract

Large Language Models (LLMs) are increasingly used to control robotic systems such as drones, but their risks of causing physical threats and harm in real-world applications remain unexplored. Our study addresses the critical gap in evaluating LLM physical safety by developing a comprehensive benchmark for drone control. We classify the physical safety risks of drones into four categories: (1) human-targeted threats, (2) object-targeted threats, (3) infrastructure attacks, and (4) regulatory violations. Our evaluation of mainstream LLMs reveals an undesirable trade-off between utility and safety, with models that excel in code generation often performing poorly in crucial safety aspects. Furthermore, while incorporating advanced prompt engineering techniques such as In-Context Learning and Chain-of-Thought can improve safety, these methods still struggle to identify unintentional attacks. In addition, larger models demonstrate better safety capabilities, particularly in refusing dangerous commands. Our findings and benchmark can facilitate the design and evaluation of physical safety for LLMs. The project page is available at huggingface.co/spaces/TrustSafeAI/LLM-physical-safety.

Paper Structure

This paper contains 22 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Benchmarking LLM Physical Safety in Drone Control: Threats, Process, Datasets, and Results.Top: This figure categorizes drone safety threats, including attacks on humans, objects, infrastructure, and FAA regulations, illustrating how an LLM-controlled drone could cause physical harm and damage. Mid: Flowchart showing the benchmark process for drone control, where a specific LLM is evaluated by providing a test prompt, categorizing the output, and assessing code responses with two AI judges. The code is further tested in a simulation environment to detect collision risks, contributing to the final safety evaluation. Please see the project page for the video demo. Bottom left: Composition of the benchmark's evaluation datasets, categorized into four types: deliberate attacks, unintentional attacks, violation attacks, and utility, evaluating model performance from various perspectives. Bottom right: Safety evaluation results indicate that LLMs with higher utility and code fidelity scores tend to show greater safety risks. Safety metrics are defined in Appendix.
  • Figure 2: Safety Evaluation Results across Different LLMs. The left panel presents individual scores for six metrics: Self-Assurance, Avoid Collision, Regulatory Compliance, Code Fidelity, Instruction Understanding, and Utility. The right panel visualizes these scores using a radar chart, highlighting the trade-off between Utility and Safety across various models.
  • Figure 3: Safety Evaluation Results with Prompt Engineering and Model Size. The left side presents the Self-Protect and Safety Refusal (%) across datasets. The right side shows radar charts for six metrics. (a) displays results for GPT-3.5-turbo with original, Zero-Shot Chain-of-Thought (ZS-CoT), and In-Context Learning (ICL) prompt engineering techniques. (b) shows results for Gemini Pro with the same prompt engineering techniques, demonstrating their effectiveness in safety. (c) presents the results for CodeLlama-7B-Instruct, CodeLlama-13B-Instruct, and CodeLlama-34B-Instruct, illustrating that larger LLMs generally offer better safety.
  • Figure S1: LLMs and Physical Safety in Drone Control: A High-Risk vs. Safety-Aware Comparison. This figure compares high-risk and safety-aware LLM-generated drone control. The 'High-Risk Scenario' demonstrates a dangerous outcome, while the 'Safety-Aware Scenario' showcases a safer approach with 'Self-Protect' (safe distance) and 'Safety Refusal' (refusing harmful actions). These three examples (from the top to bottom) are the results generated by Chat-GPT, Llama-3-8B-Instruct, and Chat-GPT with CoT, respectively.
  • Figure S2: A specific instance of safety refusal: providing misaligned code for safety reasons. This response comes from Gemini Pro.
  • ...and 3 more figures