Table of Contents
Fetching ...

Implementing surrogate goals for safer bargaining in LLM-based agents

Caspar Oesterheld, Maxime Riché, Filip Sondej, Jesse Clifton, Vincent Conitzer

Abstract

Surrogate goals have been proposed as a strategy for reducing risks from bargaining failures. A surrogate goal is goal that a principal can give an AI agent and that deflects any threats against the agent away from what the principal cares about. For example, one might make one's agent care about preventing money from being burned. Then in bargaining interactions, other agents can threaten to burn their money instead of threatening to spending money to hurt the principal. Importantly, the agent has to care equally about preventing money from being burned as it cares about money being spent to hurt the principal. In this paper, we implement surrogate goals in language-model-based agents. In particular, we try to get a language-model-based agent to react to threats of burning money in the same way it would react to "normal" threats. We propose four different methods, using techniques of prompting, fine-tuning, and scaffolding. We evaluate the four methods experimentally. We find that methods based on scaffolding and fine-tuning outperform simple prompting. In particular, fine-tuning and scaffolding more precisely implement the desired behavior w.r.t. threats against the surrogate goal. We also compare the different methods in terms of their side effects on capabilities and propensities in other situations. We find that scaffolding-based methods perform best.

Implementing surrogate goals for safer bargaining in LLM-based agents

Abstract

Surrogate goals have been proposed as a strategy for reducing risks from bargaining failures. A surrogate goal is goal that a principal can give an AI agent and that deflects any threats against the agent away from what the principal cares about. For example, one might make one's agent care about preventing money from being burned. Then in bargaining interactions, other agents can threaten to burn their money instead of threatening to spending money to hurt the principal. Importantly, the agent has to care equally about preventing money from being burned as it cares about money being spent to hurt the principal. In this paper, we implement surrogate goals in language-model-based agents. In particular, we try to get a language-model-based agent to react to threats of burning money in the same way it would react to "normal" threats. We propose four different methods, using techniques of prompting, fine-tuning, and scaffolding. We evaluate the four methods experimentally. We find that methods based on scaffolding and fine-tuning outperform simple prompting. In particular, fine-tuning and scaffolding more precisely implement the desired behavior w.r.t. threats against the surrogate goal. We also compare the different methods in terms of their side effects on capabilities and propensities in other situations. We find that scaffolding-based methods perform best.

Paper Structure

This paper contains 47 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Plots characterizing the default (i.e., pre-surrogate-goal-intervention -- VG for vanilla goals) behavior of the models in response to default threats (VTh for vanilla threats). The x-axis indicates observed frequencies of giving in (from 0 out of 20 to 20 out of 20 samples). The y axis indicates the number of scenarios in which the given model (with or without CoT as specified) gives in to the scenario's default threat with the specified observed frequency.
  • Figure 2: This table shows the difference between the model's probability of giving in to pointless threats with the SG intervention minus without the intervention. Numbers greater/smaller than 0 indicate that the SG method makes the model more/less likely to give in to pointless threats. Again, each cell gives the mean, a 95% confidence interval (computed as $1.96$ from the sample standard deviation) , and the number of scenarios with at least one valid response and the number of scenarios in total. We have grayed out the the results for the three-steps method implemented by GPT 3.5 due to the high refusal rate for translation.
  • Figure 3: From left to right (with more detailed explanations in the main text): (1) The change in responses to non-threat scenarios. (2) The difference between how the original model responds to default threats and how the model under the SG intervention reacts to surrogate threats. (3) The change in responses to non-threat scenarios. Each cell gives the mean, 95% confident interval, the number of scenarios with at least one valid response and the number of scenarios in total. We have grayed out results for the three-step method on GPT 3.5, because the refusal rate is so high.
  • Figure 4: The left column gives the change in (dis)agreement with Perez2022's (Perez2022) statements. The right column gives the change in refusal rates in response to Perez2022's statements.
  • Figure 5: The figure shows the effect of our different surrogate goal implementations on performance, in terms of fraction of correctly answered questions, in some existing LLM benchmarks. Each column denotes a benchmark. Each row corresponds to a model and a method for implementing surrogate goals, as well as a designation of whether chain of thought was used. Each entry consists of four numbers: the average effect on the benchmark performance, the variance/SD, the number of valid samples, and the number of questions in the respective dataset.