Left-right Discrepancy for Adversarial Attack on Stereo Networks
Pengfei Wang, Xiaofei Hui, Beijia Lu, Nimrod Lilith, Jun Liu, Sameer Alam
TL;DR
The paper addresses the vulnerability of stereo networks to adversarial perturbations by exploiting left-right feature mismatches. It introduces a novel left-right warping loss $\ell_w$ that, combined with the disparity loss $\ell_d$, guides perturbations that maximize intermediate feature dissimilarity between left and warped right streams, degrading disparity estimates. It proposes both a white-box attack and a proxy-network black-box attack, demonstrating substantial performance degradation on KITTI 2015 and Scene Flow across three mainstream stereo models, with MAE increases up to $219\%$ and notable left-right feature-similarity drops. The findings highlight a pronounced sensitivity of stereo networks to shallow-layer discrepancies, informing future robustness strategies and suggesting that protecting early-stage features is key to improving reliability in stereo vision systems.
Abstract
Stereo matching neural networks often involve a Siamese structure to extract intermediate features from left and right images. The similarity between these intermediate left-right features significantly impacts the accuracy of disparity estimation. In this paper, we introduce a novel adversarial attack approach that generates perturbation noise specifically designed to maximize the discrepancy between left and right image features. Extensive experiments demonstrate the superior capability of our method to induce larger prediction errors in stereo neural networks, e.g. outperforming existing state-of-the-art attack methods by 219% MAE on the KITTI dataset and 85% MAE on the Scene Flow dataset. Additionally, we extend our approach to include a proxy network black-box attack method, eliminating the need for access to stereo neural network. This method leverages an arbitrary network from a different vision task as a proxy to generate adversarial noise, effectively causing the stereo network to produce erroneous predictions. Our findings highlight a notable sensitivity of stereo networks to discrepancies in shallow layer features, offering valuable insights that could guide future research in enhancing the robustness of stereo vision systems.
