Red-Teaming Segment Anything Model
Krzysztof Jankowski, Bartlomiej Sobieski, Mateusz Kwiatkowski, Jakub Szulc, Michal Janik, Hubert Baniecki, Przemyslaw Biecek
TL;DR
This work applies a comprehensive red-teaming framework to the Segment Anything Model (SAM) to uncover robustness and safety gaps in segmentation. It spans three axes—style-transfer perturbations, privacy-related celebrity-face prompts, and adversarial segmentation attacks—and introduces the Focused Iterative Gradient Attack (FIGA) to efficiently degrade masks with limited pixel changes. Findings reveal substantial mask degradation under extreme weather with windshield drops, varying privacy leakage depending on celebrity prompts, and potent white-box attacks contrasted by weaker black-box attacks, prompting defense strategies such as adversarial training and data filtering. The study underscores the need for explicit safety guarantees and robust defenses when deploying segmentation foundations in real-world systems like autonomous driving and privacy-sensitive pipelines.
Abstract
Foundation models have emerged as pivotal tools, tackling many complex tasks through pre-training on vast datasets and subsequent fine-tuning for specific applications. The Segment Anything Model is one of the first and most well-known foundation models for computer vision segmentation tasks. This work presents a multi-faceted red-teaming analysis that tests the Segment Anything Model against challenging tasks: (1) We analyze the impact of style transfer on segmentation masks, demonstrating that applying adverse weather conditions and raindrops to dashboard images of city roads significantly distorts generated masks. (2) We focus on assessing whether the model can be used for attacks on privacy, such as recognizing celebrities' faces, and show that the model possesses some undesired knowledge in this task. (3) Finally, we check how robust the model is to adversarial attacks on segmentation masks under text prompts. We not only show the effectiveness of popular white-box attacks and resistance to black-box attacks but also introduce a novel approach - Focused Iterative Gradient Attack (FIGA) that combines white-box approaches to construct an efficient attack resulting in a smaller number of modified pixels. All of our testing methods and analyses indicate a need for enhanced safety measures in foundation models for image segmentation.
