LocalStyleFool: Regional Video Style Transfer Attack Using Segment Anything Model
Yuxin Cao, Jinghao Li, Xi Xiao, Derui Wang, Minhui Xue, Hao Ge, Wei Liu, Guangwu Hu
TL;DR
The paper addresses the vulnerability of video recognition systems to unrestricted adversarial attacks by introducing LocalStyleFool, a regional style-transfer-based black-box attack. It leverages the Segment Anything Model (SAM) to semantically segment frames, selects regions using a Grad-CAM–informed associative criterion, and tracks regions with Track Anything Model (TAM) to maintain temporal consistency, applying style transfer to selected regions with carefully designed losses. Key contributions include the first use of SAM for malicious video attacks, improved intra-frame and inter-frame naturalness over prior methods, and demonstrated effectiveness on high-resolution datasets with competitive query efficiency. The work highlights potential security risks of powerful segmentation models and motivates future defenses against region-based style-transfer adversaries in video systems.
Abstract
Previous work has shown that well-crafted adversarial perturbations can threaten the security of video recognition systems. Attackers can invade such models with a low query budget when the perturbations are semantic-invariant, such as StyleFool. Despite the query efficiency, the naturalness of the minutia areas still requires amelioration, since StyleFool leverages style transfer to all pixels in each frame. To close the gap, we propose LocalStyleFool, an improved black-box video adversarial attack that superimposes regional style-transfer-based perturbations on videos. Benefiting from the popularity and scalably usability of Segment Anything Model (SAM), we first extract different regions according to semantic information and then track them through the video stream to maintain the temporal consistency. Then, we add style-transfer-based perturbations to several regions selected based on the associative criterion of transfer-based gradient information and regional area. Perturbation fine adjustment is followed to make stylized videos adversarial. We demonstrate that LocalStyleFool can improve both intra-frame and inter-frame naturalness through a human-assessed survey, while maintaining competitive fooling rate and query efficiency. Successful experiments on the high-resolution dataset also showcase that scrupulous segmentation of SAM helps to improve the scalability of adversarial attacks under high-resolution data.
