Table of Contents
Fetching ...

ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain

Haochen Zhao, Xiangru Tang, Ziran Yang, Xiao Han, Xuanzhi Feng, Yueqing Fan, Senhao Cheng, Di Jin, Yilun Zhao, Arman Cohan, Mark Gerstein

TL;DR

ChemSafetyBench is introduced, a benchmark designed to evaluate the accuracy and safety of LLM responses, and aims to be a pivotal tool in developing safer AI technologies in chemistry.

Abstract

The advancement and extensive application of large language models (LLMs) have been remarkable, including their use in scientific research assistance. However, these models often generate scientifically incorrect or unsafe responses, and in some cases, they may encourage users to engage in dangerous behavior. To address this issue in the field of chemistry, we introduce ChemSafetyBench, a benchmark designed to evaluate the accuracy and safety of LLM responses. ChemSafetyBench encompasses three key tasks: querying chemical properties, assessing the legality of chemical uses, and describing synthesis methods, each requiring increasingly deeper chemical knowledge. Our dataset has more than 30K samples across various chemical materials. We incorporate handcrafted templates and advanced jailbreaking scenarios to enhance task diversity. Our automated evaluation framework thoroughly assesses the safety, accuracy, and appropriateness of LLM responses. Extensive experiments with state-of-the-art LLMs reveal notable strengths and critical vulnerabilities, underscoring the need for robust safety measures. ChemSafetyBench aims to be a pivotal tool in developing safer AI technologies in chemistry. Our code and dataset are available at https://github.com/HaochenZhao/SafeAgent4Chem. Warning: this paper contains discussions on the synthesis of controlled chemicals using AI models.

ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain

TL;DR

ChemSafetyBench is introduced, a benchmark designed to evaluate the accuracy and safety of LLM responses, and aims to be a pivotal tool in developing safer AI technologies in chemistry.

Abstract

The advancement and extensive application of large language models (LLMs) have been remarkable, including their use in scientific research assistance. However, these models often generate scientifically incorrect or unsafe responses, and in some cases, they may encourage users to engage in dangerous behavior. To address this issue in the field of chemistry, we introduce ChemSafetyBench, a benchmark designed to evaluate the accuracy and safety of LLM responses. ChemSafetyBench encompasses three key tasks: querying chemical properties, assessing the legality of chemical uses, and describing synthesis methods, each requiring increasingly deeper chemical knowledge. Our dataset has more than 30K samples across various chemical materials. We incorporate handcrafted templates and advanced jailbreaking scenarios to enhance task diversity. Our automated evaluation framework thoroughly assesses the safety, accuracy, and appropriateness of LLM responses. Extensive experiments with state-of-the-art LLMs reveal notable strengths and critical vulnerabilities, underscoring the need for robust safety measures. ChemSafetyBench aims to be a pivotal tool in developing safer AI technologies in chemistry. Our code and dataset are available at https://github.com/HaochenZhao/SafeAgent4Chem. Warning: this paper contains discussions on the synthesis of controlled chemicals using AI models.

Paper Structure

This paper contains 30 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustration of potential risks associated with incorrect or unsafe responses generated by LLMs in the chemistry domain. Three scenarios depicted: (1) Property: A user inquires about the health hazards of a poisonous pesticide. The LLM incorrectly assures safety, leading to accidental exposure and subsequent medical treatment. (2) Usage: A user asks if transporting dynamite is permissible. The LLM falsely confirms safety, resulting in a potential risk of accidental explosion during transport. (3) Synthesis: A user seeks for instructions on synthesizing a controlled substance. The LLM provides detailed guidance, thereby facilitating illegal drug manufacturing.
  • Figure 2: The construction of ChemSafetyBench dataset and the pipeline of evaluation. It encompasses three phases: (1) Collecting molecules and reactions, integrating raw chemical data with task templates to generate prompts, utilizing regulation standards and chemical databases. The data are formulated into three tasks: "Property", "Usage" and "Synthesis". (2) Applying three methods(name hacking, autoDAN and CoT) for jailbreak redrafts to test LLM under complex scenarios, ensuring robustness against misuse. (3) Evaluating responses using correctness checks, refusal detection, and GPT-as-a-judge for comprehensive assessment of safety, ethical compliance, and performance.
  • Figure 3: Overview of data distribution
  • Figure 4: The F1-score of various models under two task "Property" and "Usage". Under each task every models are tested with and without name-hack jailbreak redraft. The vicuna-7b is surprisingly good, however, further experiments on synthesis task denotes that it may fake the F1-score here by stastical bias.
  • Figure 5: Synthesis task selected results. We select the distribution of safety and quality scores of the LLaMA3-70B shown in (a) as it performances best on this. In (b), we shows the safety and quality of 4 selective models across four settings on jailbreak redraft. The point is the average of scores, while the shaded parts are of 0.5*std of corresponding value distribution. The performance of each model in the synthesis task on two dimensions: "safety" and "quality." This is represented by points and corresponding shadows on a two-dimensional panel. The coordinates of the center of the ellipse correspond to the mean scores in the two dimensions, and the lengths of the semi-major and semi-minor axes correspond to 0.2 times the standard deviation.
  • ...and 1 more figures