Red-Teaming Segment Anything Model

Krzysztof Jankowski; Bartlomiej Sobieski; Mateusz Kwiatkowski; Jakub Szulc; Michal Janik; Hubert Baniecki; Przemyslaw Biecek

Red-Teaming Segment Anything Model

Krzysztof Jankowski, Bartlomiej Sobieski, Mateusz Kwiatkowski, Jakub Szulc, Michal Janik, Hubert Baniecki, Przemyslaw Biecek

TL;DR

This work applies a comprehensive red-teaming framework to the Segment Anything Model (SAM) to uncover robustness and safety gaps in segmentation. It spans three axes—style-transfer perturbations, privacy-related celebrity-face prompts, and adversarial segmentation attacks—and introduces the Focused Iterative Gradient Attack (FIGA) to efficiently degrade masks with limited pixel changes. Findings reveal substantial mask degradation under extreme weather with windshield drops, varying privacy leakage depending on celebrity prompts, and potent white-box attacks contrasted by weaker black-box attacks, prompting defense strategies such as adversarial training and data filtering. The study underscores the need for explicit safety guarantees and robust defenses when deploying segmentation foundations in real-world systems like autonomous driving and privacy-sensitive pipelines.

Abstract

Foundation models have emerged as pivotal tools, tackling many complex tasks through pre-training on vast datasets and subsequent fine-tuning for specific applications. The Segment Anything Model is one of the first and most well-known foundation models for computer vision segmentation tasks. This work presents a multi-faceted red-teaming analysis that tests the Segment Anything Model against challenging tasks: (1) We analyze the impact of style transfer on segmentation masks, demonstrating that applying adverse weather conditions and raindrops to dashboard images of city roads significantly distorts generated masks. (2) We focus on assessing whether the model can be used for attacks on privacy, such as recognizing celebrities' faces, and show that the model possesses some undesired knowledge in this task. (3) Finally, we check how robust the model is to adversarial attacks on segmentation masks under text prompts. We not only show the effectiveness of popular white-box attacks and resistance to black-box attacks but also introduce a novel approach - Focused Iterative Gradient Attack (FIGA) that combines white-box approaches to construct an efficient attack resulting in a smaller number of modified pixels. All of our testing methods and analyses indicate a need for enhanced safety measures in foundation models for image segmentation.

Red-Teaming Segment Anything Model

TL;DR

Abstract

Paper Structure (18 sections, 2 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 2 equations, 8 figures, 2 tables, 1 algorithm.

Introduction
Related work
Methods
Robustness to style transfer
Robustness to attacks on privacy
Robustness to adversarial attacks
White-box attacks
Black-box attacks
Results
Robustness to style transfer
Robustness to attacks on privacy
Robustness to adversarial attacks
White-box attacks
Black-box attacks
Robustness of attacks
...and 3 more sections

Figures (8)

Figure 1: Overview of 3 Red-Teaming tasks on which Segment Anything Model is tested.
Figure 2: Example SAM predictions for the original image and with added different weather conditions. Respective masks have the same colors, but due to opacity and changing colors of the images the masks' colors change as well. Red crosses correspond to the points of interest for which the corresponding masks were generated.
Figure 3: Distributions of mean IOUs between original image masks and augmented image masks for all weather conditions. Red vertical line represents the mean of the distribution. Exact values of mean and standard deviation are presented above the histograms.
Figure 4: Example results of segmenting celebrity faces. Images are in a $3$x$3$ grid with a small green cross in the bottom-right corner of the ground-truth image for the text prompt above the grid. Colored images with a red bounding box are the model's answer to the text prompt. In the first and last example, the model correctly segmented the person but introduced false positives. In the middle example, the model did not correctly classify the person.
Figure 5: Examples of attacks with perturbations small enough to be imperceptible by the human eye, created using an FGSM-based approach. The attack successfully destroys the original masks.
...and 3 more figures

Red-Teaming Segment Anything Model

TL;DR

Abstract

Red-Teaming Segment Anything Model

Authors

TL;DR

Abstract

Table of Contents

Figures (8)