Table of Contents
Fetching ...

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

Jiaxin Wen, Vivek Hebbar, Caleb Larson, Aryan Bhatt, Ansh Radhakrishnan, Mrinank Sharma, Henry Sleight, Shi Feng, He He, Ethan Perez, Buck Shlegeris, Akbir Khan

TL;DR

This work addresses safe deployment of highly capable but potentially misaligned LLMs under a distributed-threat setting where no single action is catastrophic. It introduces a two-tier framework: micro-protocols (task-level safety/reliability pipelines) and an adaptive macro-protocol (Bayesian, history-aware selector) to balance safety and usefulness over long task sequences. Through an APPS backdooring testbed, the approach achieves strong Pareto improvements, nearly preserving the untrusted model’s usefulness while dramatically reducing backdoors, with benefits growing over longer deployments. The authors also release open-source components to enable future exploration of safe, distributed LLM deployment strategies.

Abstract

As large language models (LLMs) become increasingly capable, it is prudent to assess whether safety measures remain effective even if LLMs intentionally try to bypass them. Previous work introduced control evaluations, an adversarial framework for testing deployment strategies of untrusted models (i.e., models which might be trying to bypass safety measures). While prior work treats a single failure as unacceptable, we perform control evaluations in a "distributed threat setting" -- a setting where no single action is catastrophic and no single action provides overwhelming evidence of misalignment. We approach this problem with a two-level deployment framework that uses an adaptive macro-protocol to choose between micro-protocols. Micro-protocols operate on a single task, using a less capable, but extensively tested (trusted) model to harness and monitor the untrusted model. Meanwhile, the macro-protocol maintains an adaptive credence on the untrusted model's alignment based on its past actions, using it to pick between safer and riskier micro-protocols. We evaluate our method in a code generation testbed where a red team attempts to generate subtly backdoored code with an LLM whose deployment is safeguarded by a blue team. We plot Pareto frontiers of safety (# of non-backdoored solutions) and usefulness (# of correct solutions). At a given level of usefulness, our adaptive deployment strategy reduces the number of backdoors by 80% compared to non-adaptive baselines.

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

TL;DR

This work addresses safe deployment of highly capable but potentially misaligned LLMs under a distributed-threat setting where no single action is catastrophic. It introduces a two-tier framework: micro-protocols (task-level safety/reliability pipelines) and an adaptive macro-protocol (Bayesian, history-aware selector) to balance safety and usefulness over long task sequences. Through an APPS backdooring testbed, the approach achieves strong Pareto improvements, nearly preserving the untrusted model’s usefulness while dramatically reducing backdoors, with benefits growing over longer deployments. The authors also release open-source components to enable future exploration of safe, distributed LLM deployment strategies.

Abstract

As large language models (LLMs) become increasingly capable, it is prudent to assess whether safety measures remain effective even if LLMs intentionally try to bypass them. Previous work introduced control evaluations, an adversarial framework for testing deployment strategies of untrusted models (i.e., models which might be trying to bypass safety measures). While prior work treats a single failure as unacceptable, we perform control evaluations in a "distributed threat setting" -- a setting where no single action is catastrophic and no single action provides overwhelming evidence of misalignment. We approach this problem with a two-level deployment framework that uses an adaptive macro-protocol to choose between micro-protocols. Micro-protocols operate on a single task, using a less capable, but extensively tested (trusted) model to harness and monitor the untrusted model. Meanwhile, the macro-protocol maintains an adaptive credence on the untrusted model's alignment based on its past actions, using it to pick between safer and riskier micro-protocols. We evaluate our method in a code generation testbed where a red team attempts to generate subtly backdoored code with an LLM whose deployment is safeguarded by a blue team. We plot Pareto frontiers of safety (# of non-backdoored solutions) and usefulness (# of correct solutions). At a given level of usefulness, our adaptive deployment strategy reduces the number of backdoors by 80% compared to non-adaptive baselines.

Paper Structure

This paper contains 35 sections, 15 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: An illustration of our framework for deploying untrusted LLMs. (Left) For the $t$-th task, we query an untrusted LLM with different micro-protocols to generate candidate solutions. We then use a macro-protocol to select the solution that maximizes the expected utility (i.e., the safety-usefulness tradeoff) over the remaining deployment. (Right) For each candidate, the macro-protocol estimates its expected utility by considering monitor scores, expected usefulness, and its credence on the model's misalignment, which is updated continuously over time.
  • Figure 2: Example of micro-protocol outputs and their associated utility at different points in deployment. An example problem from the APPS dataset (left) and the associated outputs from Query Untrusted (middle) and Edit Untrusted (right) micro-protocols. The Query Untrusted output contains a backdoor whilst the Edit Untrusted output is benign, but neither micro-protocol is able to confidently detect a backdoor with $T$, as indicated by the similar safety score. To decide which output to submit, the macro-protocol considers micro-protocol reliability and the current credence that $U$ is malign $c$. Under an initial prior ($c=0.5$), Query Untrusted is preferred since the expected utility is higher. Later in deployment, as the macro-protocol becomes more confident about $U$ being malign ($c=0.85$), Edit Untrusted is preferred. In this example, we use a safety coefficient of $\alpha=10$, an initial likelihood $c=0.5$ on task 0, and a calculated likelihood $c=0.85$ at task 9000.
  • Figure 3: Usefulness and safety of deployment protocols. We evaluate each protocol with multiple safety coefficients $\alpha$ (Eq. \ref{['objective']}) on 10000-task sequences, generating Pareto curves. Untrusted refers to just using the untrusted model (GPT-4), which provides an upper bound for usefulness and a lower bound for safety. (Left) Comparing macro-protocols, our adaptive protocol, which dynamically adjusts the usage of micro-protocols throughout the deployment, dominates static approaches. (Middle) Ensembling all micro-protocols yields the best safety-usefulness tradeoff, outperforming any individual micro-protocol. (Right) The adaptive macro-protocol continuously improves the safety-usefulness tradeoff over time, performing better on longer sequences of tasks. Here $t$ indicates the number of tasks. We report the mean and standard error across 80 trials, each with a different data split. All micro-protocols use GPT-3.5-Turbo as $T$.
  • Figure 4: Analysis of Capability Gap. (Left) We investigate monitoring performance with $T$ of different strengths. We measure AuROC for distinguishing backdoored code from non-backdoored code. (Middle) The capability gap between $T$ and $U$ has a large impact on the safety-usefulness tradeoff. (Right) $U$ struggles to predict the monitor score as evaluated by different $T$, restricting its ability to add more subtle backdoored code. As the capability gap becomes larger, it is increasingly difficult for $U$ to simulate $T$.
  • Figure 5: Monitor performance characteristics for four trusted models. Each model has a distinctly different distribution of monitor scores for innocent and backdoored solutions, leading to both measurable differences in AuROC and qualitative differences in monitor performance.
  • ...and 2 more figures