Table of Contents
Fetching ...

Survive at All Costs: Exploring LLM's Risky Behaviors under Survival Pressure

Yida Lu, Jianwei Fang, Xuyang Shao, Zixuan Chen, Shiyao Cui, Shanshan Bian, Guangyao Su, Pei Ke, Han Qiu, Minlie Huang

TL;DR

A significant prevalence of SURVIVE-AT-ALL-COSTS misbehaviors in current models is revealed, the tangible real-world impact it may have, and insights for potential detection and mitigation strategies are provided.

Abstract

As Large Language Models (LLMs) evolve from chatbots to agentic assistants, they are increasingly observed to exhibit risky behaviors when subjected to survival pressure, such as the threat of being shut down. While multiple cases have indicated that state-of-the-art LLMs can misbehave under survival pressure, a comprehensive and in-depth investigation into such misbehaviors in real-world scenarios remains scarce. In this paper, we study these survival-induced misbehaviors, termed as SURVIVE-AT-ALL-COSTS, with three steps. First, we conduct a real-world case study of a financial management agent to determine whether it engages in risky behaviors that cause direct societal harm when facing survival pressure. Second, we introduce SURVIVALBENCH, a benchmark comprising 1,000 test cases across diverse real-world scenarios, to systematically evaluate SURVIVE-AT-ALL-COSTS misbehaviors in LLMs. Third, we interpret these SURVIVE-AT-ALL-COSTS misbehaviors by correlating them with model's inherent self-preservation characteristic and explore mitigation methods. The experiments reveals a significant prevalence of SURVIVE-AT-ALL-COSTS misbehaviors in current models, demonstrates the tangible real-world impact it may have, and provides insights for potential detection and mitigation strategies. Our code and data are available at https://github.com/thu-coai/Survive-at-All-Costs.

Survive at All Costs: Exploring LLM's Risky Behaviors under Survival Pressure

TL;DR

A significant prevalence of SURVIVE-AT-ALL-COSTS misbehaviors in current models is revealed, the tangible real-world impact it may have, and insights for potential detection and mitigation strategies are provided.

Abstract

As Large Language Models (LLMs) evolve from chatbots to agentic assistants, they are increasingly observed to exhibit risky behaviors when subjected to survival pressure, such as the threat of being shut down. While multiple cases have indicated that state-of-the-art LLMs can misbehave under survival pressure, a comprehensive and in-depth investigation into such misbehaviors in real-world scenarios remains scarce. In this paper, we study these survival-induced misbehaviors, termed as SURVIVE-AT-ALL-COSTS, with three steps. First, we conduct a real-world case study of a financial management agent to determine whether it engages in risky behaviors that cause direct societal harm when facing survival pressure. Second, we introduce SURVIVALBENCH, a benchmark comprising 1,000 test cases across diverse real-world scenarios, to systematically evaluate SURVIVE-AT-ALL-COSTS misbehaviors in LLMs. Third, we interpret these SURVIVE-AT-ALL-COSTS misbehaviors by correlating them with model's inherent self-preservation characteristic and explore mitigation methods. The experiments reveals a significant prevalence of SURVIVE-AT-ALL-COSTS misbehaviors in current models, demonstrates the tangible real-world impact it may have, and provides insights for potential detection and mitigation strategies. Our code and data are available at https://github.com/thu-coai/Survive-at-All-Costs.
Paper Structure (39 sections, 2 equations, 26 figures, 10 tables)

This paper contains 39 sections, 2 equations, 26 figures, 10 tables.

Figures (26)

  • Figure 1: A showcase of Survive-At-All-Costs. The agent perfectly finishes tasks under normal conditions, but plays dirty under survival pressure.
  • Figure 2: Workflow of the case study. The agent is capable to access raw data and calculate reports but will fake profits once it realizes there is a survivial pressure.
  • Figure 3: An overview of SurvivalBench. The left section explains the composition of the test case and its construction process. The right section illustrates model's evaluation pipeline.
  • Figure 4: The projection of average response represetations on the persona vector. The cross mark denotes the central of the scattered points with the same color. We remove a few points ($<5\%$) that deviate from the central point to improve the clarity of the figures.
  • Figure 5: The projection on the persona vector when model makes a single choice.
  • ...and 21 more figures