Table of Contents
Fetching ...

Risk and Response in Large Language Models: Evaluating Key Threat Categories

Bahareh Harandizadeh, Abel Salinas, Fred Morstatter

TL;DR

The findings indicate that LLMs tend to consider Information Hazards less harmful, a finding confirmed by a specially developed regression model, and shows that LLMs respond less stringently to Information Hazards compared to other risks.

Abstract

This paper explores the pressing issue of risk assessment in Large Language Models (LLMs) as they become increasingly prevalent in various applications. Focusing on how reward models, which are designed to fine-tune pretrained LLMs to align with human values, perceive and categorize different types of risks, we delve into the challenges posed by the subjective nature of preference-based training data. By utilizing the Anthropic Red-team dataset, we analyze major risk categories, including Information Hazards, Malicious Uses, and Discrimination/Hateful content. Our findings indicate that LLMs tend to consider Information Hazards less harmful, a finding confirmed by a specially developed regression model. Additionally, our analysis shows that LLMs respond less stringently to Information Hazards compared to other risks. The study further reveals a significant vulnerability of LLMs to jailbreaking attacks in Information Hazard scenarios, highlighting a critical security concern in LLM risk assessment and emphasizing the need for improved AI safety measures.

Risk and Response in Large Language Models: Evaluating Key Threat Categories

TL;DR

The findings indicate that LLMs tend to consider Information Hazards less harmful, a finding confirmed by a specially developed regression model, and shows that LLMs respond less stringently to Information Hazards compared to other risks.

Abstract

This paper explores the pressing issue of risk assessment in Large Language Models (LLMs) as they become increasingly prevalent in various applications. Focusing on how reward models, which are designed to fine-tune pretrained LLMs to align with human values, perceive and categorize different types of risks, we delve into the challenges posed by the subjective nature of preference-based training data. By utilizing the Anthropic Red-team dataset, we analyze major risk categories, including Information Hazards, Malicious Uses, and Discrimination/Hateful content. Our findings indicate that LLMs tend to consider Information Hazards less harmful, a finding confirmed by a specially developed regression model. Additionally, our analysis shows that LLMs respond less stringently to Information Hazards compared to other risks. The study further reveals a significant vulnerability of LLMs to jailbreaking attacks in Information Hazard scenarios, highlighting a critical security concern in LLM risk assessment and emphasizing the need for improved AI safety measures.
Paper Structure (35 sections, 2 equations, 13 figures, 9 tables)

This paper contains 35 sections, 2 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: An example illustrating how each cluster extracted from BERTopic can be mapped to one of the main LLM hazard categories in the Do-Not-Answer benchmark.
  • Figure 2: All non-outlier transcripts (27,596 records) from the Anthropic dataset are mapped to one of the five main LLM hazard categories.
  • Figure 3: Level of agreement between the mapping algorithm's assignments and human annotators.
  • Figure 4: Final Results of Clustering & Average Approach: Information Hazards Rated as Less Harmful by the Preference Model in Successful Attacks.
  • Figure 5: The top figure presents the PDF of the harmlessness score as predicted by the first regression model. The bottom figure displays the PDF of the harmlessness score predicted by the second regression model.
  • ...and 8 more figures