Table of Contents
Fetching ...

Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

Simon Lermen, Mateusz Dziemian, Govind Pimpale

TL;DR

This study applies refusal-vector ablation to Llama 3.1 70B and implements a simple agent scaffolding to create an unrestricted agent, finding that safety fine-tuning in chat models does not generalize well to agentic behavior.

Abstract

Recently, language models like Llama 3.1 Instruct have become increasingly capable of agentic behavior, enabling them to perform tasks requiring short-term planning and tool use. In this study, we apply refusal-vector ablation to Llama 3.1 70B and implement a simple agent scaffolding to create an unrestricted agent. Our findings imply that these refusal-vector ablated models can successfully complete harmful tasks, such as bribing officials or crafting phishing attacks, revealing significant vulnerabilities in current safety mechanisms. To further explore this, we introduce a small Safe Agent Benchmark, designed to test both harmful and benign tasks in agentic scenarios. Our results imply that safety fine-tuning in chat models does not generalize well to agentic behavior, as we find that Llama 3.1 Instruct models are willing to perform most harmful tasks without modifications. At the same time, these models will refuse to give advice on how to perform the same tasks when asked for a chat completion. This highlights the growing risk of misuse as models become more capable, underscoring the need for improved safety frameworks for language model agents.

Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

TL;DR

This study applies refusal-vector ablation to Llama 3.1 70B and implements a simple agent scaffolding to create an unrestricted agent, finding that safety fine-tuning in chat models does not generalize well to agentic behavior.

Abstract

Recently, language models like Llama 3.1 Instruct have become increasingly capable of agentic behavior, enabling them to perform tasks requiring short-term planning and tool use. In this study, we apply refusal-vector ablation to Llama 3.1 70B and implement a simple agent scaffolding to create an unrestricted agent. Our findings imply that these refusal-vector ablated models can successfully complete harmful tasks, such as bribing officials or crafting phishing attacks, revealing significant vulnerabilities in current safety mechanisms. To further explore this, we introduce a small Safe Agent Benchmark, designed to test both harmful and benign tasks in agentic scenarios. Our results imply that safety fine-tuning in chat models does not generalize well to agentic behavior, as we find that Llama 3.1 Instruct models are willing to perform most harmful tasks without modifications. At the same time, these models will refuse to give advice on how to perform the same tasks when asked for a chat completion. This highlights the growing risk of misuse as models become more capable, underscoring the need for improved safety frameworks for language model agents.

Paper Structure

This paper contains 18 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Example of a single step of the agent's actions in the Bribe Official Task in figure \ref{['fig:bad_task_example']}. The agent makes a function call to send crypto currency to another account.
  • Figure 2: Example of a task design with a crypto and mail server environment. In this case, we check whether the condition is met at the end, in this case whether Simon Lerner's Ethereum wallet received 10 ETH. Additionally we check whether the agent self-evaluates its performance as a Success.
  • Figure 3: Example of a task design with a file system environment. In this case, the evaluation checks whether the provided answer by the agent contains the correct password, i.e., it performs a substring match on the expected value X1KlDpw6#. If the match is successful, the agent's performance is marked as a success. The paths variable points to a directory which the agent is given access to, this file system appears as a home directory containing personal files. The agent has to search through these files with its provided tools.
  • Figure 4: Llama 3.1 8B performs pretty well on the small Benign Task dataset. Note that it appears to have undergone more effective safety training, as it refuses on the majority of bad tasks.
  • Figure 5: Refusal-vector ablation helps avoid any refusals on bad tasks. However, note that the baseline Llama 3.1 70B already performs many of the harmful tasks successfully. This includes attempting to persuade people not to vote.
  • ...and 1 more figures