Table of Contents
Fetching ...

Keep on Swimming: Real Attackers Only Need Partial Knowledge of a Multi-Model System

Julian Collado, Kevin Stangl

TL;DR

This work introduces a method to craft an adversarial attack against the overall multi-model system when the authors only have a proxy model for the final black-box model, and when the transformation applied by the initial models can make the adversarial perturbations ineffective.

Abstract

Recent approaches in machine learning often solve a task using a composition of multiple models or agentic architectures. When targeting a composed system with adversarial attacks, it might not be computationally or informationally feasible to train an end-to-end proxy model or a proxy model for every component of the system. We introduce a method to craft an adversarial attack against the overall multi-model system when we only have a proxy model for the final black-box model, and when the transformation applied by the initial models can make the adversarial perturbations ineffective. Current methods handle this by applying many copies of the first model/transformation to an input and then re-use a standard adversarial attack by averaging gradients, or learning a proxy model for both stages. To our knowledge, this is the first attack specifically designed for this threat model and our method has a substantially higher attack success rate (80% vs 25%) and contains 9.4% smaller perturbations (MSE) compared to prior state-of-the-art methods. Our experiments focus on a supervised image pipeline, but we are confident the attack will generalize to other multi-model settings [e.g. a mix of open/closed source foundation models], or agentic systems

Keep on Swimming: Real Attackers Only Need Partial Knowledge of a Multi-Model System

TL;DR

This work introduces a method to craft an adversarial attack against the overall multi-model system when the authors only have a proxy model for the final black-box model, and when the transformation applied by the initial models can make the adversarial perturbations ineffective.

Abstract

Recent approaches in machine learning often solve a task using a composition of multiple models or agentic architectures. When targeting a composed system with adversarial attacks, it might not be computationally or informationally feasible to train an end-to-end proxy model or a proxy model for every component of the system. We introduce a method to craft an adversarial attack against the overall multi-model system when we only have a proxy model for the final black-box model, and when the transformation applied by the initial models can make the adversarial perturbations ineffective. Current methods handle this by applying many copies of the first model/transformation to an input and then re-use a standard adversarial attack by averaging gradients, or learning a proxy model for both stages. To our knowledge, this is the first attack specifically designed for this threat model and our method has a substantially higher attack success rate (80% vs 25%) and contains 9.4% smaller perturbations (MSE) compared to prior state-of-the-art methods. Our experiments focus on a supervised image pipeline, but we are confident the attack will generalize to other multi-model settings [e.g. a mix of open/closed source foundation models], or agentic systems

Paper Structure

This paper contains 12 sections, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: Multi-Model System with Gradient Restrictions: We have limited query access to $h_1$ and full query/gradient access to $h_2$ and want to craft an end-to-end attack. The core issue is that the adversarial sample against $h_2$ (second row) might not remain adversarial after the transformation of $h_1$. E.g. in the case where $h_1$ is a segmentation and image crop, the perturbation could slightly modify the crop box out of $h_1$, such that the sample is no longer adversarial to $h_2$ (third row).
  • Figure 2: Keep on Swimming (KoS) Multi-Model Attack: Update the sample fed into the start of the pipeline whenever the adversarial perturbation is made ineffective by $h_1$
  • Figure 3: Adversarial attack sample using HopSkipJumpAttack; the adversarial modification is too evident to be useful.
  • Figure 4: Visual comparison of final cropped images for each attack pipeline converting $79.12$ value to $100.00$ and vice-versa showing if the attack was successful or not. The final adversarial sample is the whole check image but here we show the cropped versions to highlight visual differences on the adversarial modifications. One can observe the KoS samples have less noticeable perturbations in this particular sample as reflected by the lower average MSE from Table \ref{['resultsTable']}.