Table of Contents
Fetching ...

Left-right Discrepancy for Adversarial Attack on Stereo Networks

Pengfei Wang, Xiaofei Hui, Beijia Lu, Nimrod Lilith, Jun Liu, Sameer Alam

TL;DR

The paper addresses the vulnerability of stereo networks to adversarial perturbations by exploiting left-right feature mismatches. It introduces a novel left-right warping loss $\ell_w$ that, combined with the disparity loss $\ell_d$, guides perturbations that maximize intermediate feature dissimilarity between left and warped right streams, degrading disparity estimates. It proposes both a white-box attack and a proxy-network black-box attack, demonstrating substantial performance degradation on KITTI 2015 and Scene Flow across three mainstream stereo models, with MAE increases up to $219\%$ and notable left-right feature-similarity drops. The findings highlight a pronounced sensitivity of stereo networks to shallow-layer discrepancies, informing future robustness strategies and suggesting that protecting early-stage features is key to improving reliability in stereo vision systems.

Abstract

Stereo matching neural networks often involve a Siamese structure to extract intermediate features from left and right images. The similarity between these intermediate left-right features significantly impacts the accuracy of disparity estimation. In this paper, we introduce a novel adversarial attack approach that generates perturbation noise specifically designed to maximize the discrepancy between left and right image features. Extensive experiments demonstrate the superior capability of our method to induce larger prediction errors in stereo neural networks, e.g. outperforming existing state-of-the-art attack methods by 219% MAE on the KITTI dataset and 85% MAE on the Scene Flow dataset. Additionally, we extend our approach to include a proxy network black-box attack method, eliminating the need for access to stereo neural network. This method leverages an arbitrary network from a different vision task as a proxy to generate adversarial noise, effectively causing the stereo network to produce erroneous predictions. Our findings highlight a notable sensitivity of stereo networks to discrepancies in shallow layer features, offering valuable insights that could guide future research in enhancing the robustness of stereo vision systems.

Left-right Discrepancy for Adversarial Attack on Stereo Networks

TL;DR

The paper addresses the vulnerability of stereo networks to adversarial perturbations by exploiting left-right feature mismatches. It introduces a novel left-right warping loss that, combined with the disparity loss , guides perturbations that maximize intermediate feature dissimilarity between left and warped right streams, degrading disparity estimates. It proposes both a white-box attack and a proxy-network black-box attack, demonstrating substantial performance degradation on KITTI 2015 and Scene Flow across three mainstream stereo models, with MAE increases up to and notable left-right feature-similarity drops. The findings highlight a pronounced sensitivity of stereo networks to shallow-layer discrepancies, informing future robustness strategies and suggesting that protecting early-stage features is key to improving reliability in stereo vision systems.

Abstract

Stereo matching neural networks often involve a Siamese structure to extract intermediate features from left and right images. The similarity between these intermediate left-right features significantly impacts the accuracy of disparity estimation. In this paper, we introduce a novel adversarial attack approach that generates perturbation noise specifically designed to maximize the discrepancy between left and right image features. Extensive experiments demonstrate the superior capability of our method to induce larger prediction errors in stereo neural networks, e.g. outperforming existing state-of-the-art attack methods by 219% MAE on the KITTI dataset and 85% MAE on the Scene Flow dataset. Additionally, we extend our approach to include a proxy network black-box attack method, eliminating the need for access to stereo neural network. This method leverages an arbitrary network from a different vision task as a proxy to generate adversarial noise, effectively causing the stereo network to produce erroneous predictions. Our findings highlight a notable sensitivity of stereo networks to discrepancies in shallow layer features, offering valuable insights that could guide future research in enhancing the robustness of stereo vision systems.
Paper Structure (15 sections, 2 equations, 7 figures, 4 tables)

This paper contains 15 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Illustration of the left-right discrepancy adversarial attack. The prediction results of the stereo network without adversarial attack show good performance. Compared to the existing attack method, the proposed attack method causes the stereo network to predict more significant error.
  • Figure 2: The architecture of the proposed attack method. The stereo network takes the left image $I_L$ and right image $I_R$ as input, producing the estimated disparity $D_{est}$. We extract arbitrary intermediate left and right features $F_L, F_R$, and warp the $F_R$ based on the disparity (either ground truth Disparity $D_{gt}$ or the prediction $D_{pred}$ with clean images) to craft a pseudo feature $F_{R_w}$. A novel warping loss function $\ell_w$ is introduced to maximize the dissimilarity between the $F_L$ and $F_{R_w}$, which can be aggregated with the commonly used loss function of disparity $\ell_{d}$. The adversarial noise $\delta_L, \delta_R$ is generated by maximizing the loss function using the FGSM goodfellow2014explaining or I-FGSM kurakin2016adversarial algorithm. The generated noise $\delta_L, \delta_R$ is added to the input $I_L, I_R$, resulting in the noisy predicted disparity result $D_{noise}$.
  • Figure 3: White-box attack results on AANet with Scene Flow dataset
  • Figure 4: Distribution of disparity before and after adversarial attack. "Clean" represents the disparity prediction of the stereo network without adversarial attack, "Vanilla" and "Ours" represents the disparity prediction after adversarial attack by vanilla loss $\ell_d$ and our joint loss function $\ell$, respectively.
  • Figure 5: Left-right feature similarity before and after white-box adversarial attack. "Clean" represents the similarity values without adversarial attack. $F_i, i=1,2,3$ represents the similarity values by applying warping loss using $F_i, i=1,2,3$ respectively.
  • ...and 2 more figures