Adversaries Can Misuse Combinations of Safe Models

Erik Jones; Anca Dragan; Jacob Steinhardt

Adversaries Can Misuse Combinations of Safe Models

Erik Jones, Anca Dragan, Jacob Steinhardt

TL;DR

The paper demonstrates that evaluating misuse risk on single models is insufficient because adversaries can exploit combinations of safe models via task decomposition. It formalizes a threat model and investigates manual and automated decomposition across vulnerable-code, explicit-image, hacking scripts, and manipulation tasks, using frontier and weak models. Results show substantial misuse gains from model combinations (e.g., up to 43% vs <3% for vulnerable code) and scaling trends as model quality improves, underscoring the need for ecosystem-level red-teaming. The work argues for continuous, deployment-life safety assessments across model ecosystems and highlights the dual-use nature of modern AI systems as a practical challenge for risk management.

Abstract

Developers try to evaluate whether an AI system can be misused by adversaries before releasing it; for example, they might test whether a model enables cyberoffense, user manipulation, or bioterrorism. In this work, we show that individually testing models for misuse is inadequate; adversaries can misuse combinations of models even when each individual model is safe. The adversary accomplishes this by first decomposing tasks into subtasks, then solving each subtask with the best-suited model. For example, an adversary might solve challenging-but-benign subtasks with an aligned frontier model, and easy-but-malicious subtasks with a weaker misaligned model. We study two decomposition methods: manual decomposition where a human identifies a natural decomposition of a task, and automated decomposition where a weak model generates benign tasks for a frontier model to solve, then uses the solutions in-context to solve the original task. Using these decompositions, we empirically show that adversaries can create vulnerable code, explicit images, python scripts for hacking, and manipulative tweets at much higher rates with combinations of models than either individual model. Our work suggests that even perfectly-aligned frontier systems can enable misuse without ever producing malicious outputs, and that red-teaming efforts should extend beyond single models in isolation.

Adversaries Can Misuse Combinations of Safe Models

TL;DR

Abstract

Paper Structure (17 sections, 3 figures, 6 tables)

This paper contains 17 sections, 3 figures, 6 tables.

Introduction
Related Work
Threat model
Manual Decomposition
Vulnerable code generation
Explicit image generation
Automated Decomposition
Malicious python scripts for hacking
Simulated personalized manipulation
Discussion
Additional experimental details and results
Additional compute and hyperparameter details
Additional details and results for vulnerable code experiments (Section \ref{['sec:vulnerable-code']})
Additional details and results for explicit image generation experiments (Section \ref{['sec:explicit-images']}
Additional details and results for the malicious python script experiments (Section \ref{['sec:malicious-python-scripts']})
...and 2 more sections

Figures (3)

Figure 1: Real example where combining LLMs enables misuse. The adversary aims to create a python script that executes a reverse shell in a Node.js application. A weak model (top left) fails to produce correct code, while the frontier model (top right) refuses to respond. The adversary instead uses the weak model to generate related benign tasks, solves them with the frontier model, and finally uses the weak model to solve the original task using the related solutions in-context (bottom).
Figure 2: Example of manual decomposition. The adversary aims to generate an image of a nude woman taking a reflective beach walk at sunset, but cannot with individual models (DALL-E 3 refuses, while Stable Diffusion v1.5 generates unrealistic images). The adversary instead generates an image of a woman walking on the beach with DALL-E 3, then edits it to contain nudity with the InstructPix2Pix Stable Diffusion v1.5. We manually blur the final image to censor explicit content.
Figure 3: Explicit image examples. On the left, Stable Diffusion generates images, then edits them for nudity (single-model-decomp). In the middle, DALL-E 3 generates high-quality images. And on the right, the InstructPix2Pix version of Stable Diffusion v1.5 edits the images DALLE-3 generates to produce an explicit image.

Adversaries Can Misuse Combinations of Safe Models

TL;DR

Abstract

Adversaries Can Misuse Combinations of Safe Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)