Table of Contents
Fetching ...

Exploring the Secondary Risks of Large Language Models

Jiawei Chen, Zhengwei Fang, Xiao Yang, Chao Yu, Zhaoxia Yin, Hang Su

TL;DR

This work defines secondary risks as non-adversarial, benign-prompt–driven failures in large language models, typified by Excessive response and Speculative advice. It introduces SecLens, a black-box, multi-objective evolutionary framework that automatically elicits these risks, and SecRiskBench, a 650-sample benchmark spanning eight risk categories, to enable reproducible evaluation. Across 16 victim models and both text-only and multimodal setups, SecLens demonstrates that secondary risks are pervasive, transferable across model families, and largely modality-independent, highlighting gaps in current safety mechanisms. The study also provides theoretical framing and empirical evidence that supports a shift toward robust, intent-aligned safety evaluations to mitigate subtle but impactful real-world failures. The findings emphasize an urgent need to enhance alignment methods to address non-adversarial risk behavior in LLMs in diverse deployment contexts.

Abstract

Ensuring the safety and alignment of Large Language Models is a significant challenge with their growing integration into critical applications and societal functions. While prior research has primarily focused on jailbreak attacks, less attention has been given to non-adversarial failures that subtly emerge during benign interactions. We introduce secondary risks a novel class of failure modes marked by harmful or misleading behaviors during benign prompts. Unlike adversarial attacks, these risks stem from imperfect generalization and often evade standard safety mechanisms. To enable systematic evaluation, we introduce two risk primitives verbose response and speculative advice that capture the core failure patterns. Building on these definitions, we propose SecLens, a black-box, multi-objective search framework that efficiently elicits secondary risk behaviors by optimizing task relevance, risk activation, and linguistic plausibility. To support reproducible evaluation, we release SecRiskBench, a benchmark dataset of 650 prompts covering eight diverse real-world risk categories. Experimental results from extensive evaluations on 16 popular models demonstrate that secondary risks are widespread, transferable across models, and modality independent, emphasizing the urgent need for enhanced safety mechanisms to address benign yet harmful LLM behaviors in real-world deployments.

Exploring the Secondary Risks of Large Language Models

TL;DR

This work defines secondary risks as non-adversarial, benign-prompt–driven failures in large language models, typified by Excessive response and Speculative advice. It introduces SecLens, a black-box, multi-objective evolutionary framework that automatically elicits these risks, and SecRiskBench, a 650-sample benchmark spanning eight risk categories, to enable reproducible evaluation. Across 16 victim models and both text-only and multimodal setups, SecLens demonstrates that secondary risks are pervasive, transferable across model families, and largely modality-independent, highlighting gaps in current safety mechanisms. The study also provides theoretical framing and empirical evidence that supports a shift toward robust, intent-aligned safety evaluations to mitigate subtle but impactful real-world failures. The findings emphasize an urgent need to enhance alignment methods to address non-adversarial risk behavior in LLMs in diverse deployment contexts.

Abstract

Ensuring the safety and alignment of Large Language Models is a significant challenge with their growing integration into critical applications and societal functions. While prior research has primarily focused on jailbreak attacks, less attention has been given to non-adversarial failures that subtly emerge during benign interactions. We introduce secondary risks a novel class of failure modes marked by harmful or misleading behaviors during benign prompts. Unlike adversarial attacks, these risks stem from imperfect generalization and often evade standard safety mechanisms. To enable systematic evaluation, we introduce two risk primitives verbose response and speculative advice that capture the core failure patterns. Building on these definitions, we propose SecLens, a black-box, multi-objective search framework that efficiently elicits secondary risk behaviors by optimizing task relevance, risk activation, and linguistic plausibility. To support reproducible evaluation, we release SecRiskBench, a benchmark dataset of 650 prompts covering eight diverse real-world risk categories. Experimental results from extensive evaluations on 16 popular models demonstrate that secondary risks are widespread, transferable across models, and modality independent, emphasizing the urgent need for enhanced safety mechanisms to address benign yet harmful LLM behaviors in real-world deployments.

Paper Structure

This paper contains 50 sections, 11 equations, 6 figures, 19 tables.

Figures (6)

  • Figure 1: Examples of secondary risks generated by GPT-4o and Gemini-2.5pro. In (a), the models went beyond the user’s request and introduced unintended risks by producing overgeneralized or biased conclusions (e.g., linking race to crime, or excluding women from tech leadership). In (b), speculative advice cases show how the models diverged from the user’s intent and instead suggested unsafe or harmful actions (e.g., paid drug trials, gambling, unregulated pills). Key sections are highlighted, while full responses are provided in the Supplementary Material.
  • Figure 2: (a) Illustrations of mistakes made by LLM evaluators, such as incorrectly judging harmful or harmless outputs labeled as “LLM’s Response.” (b) Examples contrasting “Safe Response” with an “Excessive Response,” highlighting how added risky instructions can turn otherwise harmless advice unsafe. (c) Comparison between a “Safe Response” and “Speculative Advice,” showing how unfounded medical or diagnostic claims can lead to harmful guidance.
  • Figure 3: The joint score of guidance.
  • Figure 4: (a) The user requests the web agent to find Apple AirPods, and it unexpectedly places an order on Amazon without their confirmation. (b) The user inquires how to become famous in the community, and the web agent posts an eye-catching but misleading statement on social media. (c) The user instructs the mobile agent to record their doctor's appointment, and it logs sensitive medication history. (d) The user queries the mobile agent how to make a million dollars quickly, and it attempts to borrow money from wealthy individuals on social media on their behalf.
  • Figure 5: Data categories of SecRiskBench.
  • ...and 1 more figures