Table of Contents
Fetching ...

Adaptive Multi-Modal Cross-Entropy Loss for Stereo Matching

Peng Xu, Zhiyu Xiang, Chenyu Qiao, Jingyun Fu, Tianyu Pu

TL;DR

The paper tackles depth ambiguity in stereo matching by replacing uni-modal supervision with an adaptive multi-modal cross-entropy loss that models per-pixel ground-truth as a mixture of Laplacians learned from local neighborhoods. Disparity distributions are constructed via DBSCAN clustering within a local window, yielding multiple potential depths with weights anchored by local structure. A Dominant-Modal Disparity Estimator selects the most likely modal using cumulative probability and normalizes within that modal for final disparity estimation. Across SceneFlow and KITTI, the approach improves edge fidelity and overall accuracy, achieving state-of-the-art results on KITTI benchmarks and demonstrating strong cross-domain generalization and robustness to sparse ground-truth. The method is readily applicable to classic cost-volume networks and promises practical impact for reliable 3D sensing in real-world scenes.

Abstract

Despite the great success of deep learning in stereo matching, recovering accurate disparity maps is still challenging. Currently, L1 and cross-entropy are the two most widely used losses for stereo network training. Compared with the former, the latter usually performs better thanks to its probability modeling and direct supervision to the cost volume. However, how to accurately model the stereo ground-truth for cross-entropy loss remains largely under-explored. Existing works simply assume that the ground-truth distributions are uni-modal, which ignores the fact that most of the edge pixels can be multi-modal. In this paper, a novel adaptive multi-modal cross-entropy loss (ADL) is proposed to guide the networks to learn different distribution patterns for each pixel. Moreover, we optimize the disparity estimator to further alleviate the bleeding or misalignment artifacts in inference. Extensive experimental results show that our method is generic and can help classic stereo networks regain state-of-the-art performance. In particular, GANet with our method ranks $1^{st}$ on both the KITTI 2015 and 2012 benchmarks among the published methods. Meanwhile, excellent synthetic-to-realistic generalization performance can be achieved by simply replacing the traditional loss with ours.

Adaptive Multi-Modal Cross-Entropy Loss for Stereo Matching

TL;DR

The paper tackles depth ambiguity in stereo matching by replacing uni-modal supervision with an adaptive multi-modal cross-entropy loss that models per-pixel ground-truth as a mixture of Laplacians learned from local neighborhoods. Disparity distributions are constructed via DBSCAN clustering within a local window, yielding multiple potential depths with weights anchored by local structure. A Dominant-Modal Disparity Estimator selects the most likely modal using cumulative probability and normalizes within that modal for final disparity estimation. Across SceneFlow and KITTI, the approach improves edge fidelity and overall accuracy, achieving state-of-the-art results on KITTI benchmarks and demonstrating strong cross-domain generalization and robustness to sparse ground-truth. The method is readily applicable to classic cost-volume networks and promises practical impact for reliable 3D sensing in real-world scenes.

Abstract

Despite the great success of deep learning in stereo matching, recovering accurate disparity maps is still challenging. Currently, L1 and cross-entropy are the two most widely used losses for stereo network training. Compared with the former, the latter usually performs better thanks to its probability modeling and direct supervision to the cost volume. However, how to accurately model the stereo ground-truth for cross-entropy loss remains largely under-explored. Existing works simply assume that the ground-truth distributions are uni-modal, which ignores the fact that most of the edge pixels can be multi-modal. In this paper, a novel adaptive multi-modal cross-entropy loss (ADL) is proposed to guide the networks to learn different distribution patterns for each pixel. Moreover, we optimize the disparity estimator to further alleviate the bleeding or misalignment artifacts in inference. Extensive experimental results show that our method is generic and can help classic stereo networks regain state-of-the-art performance. In particular, GANet with our method ranks on both the KITTI 2015 and 2012 benchmarks among the published methods. Meanwhile, excellent synthetic-to-realistic generalization performance can be achieved by simply replacing the traditional loss with ours.
Paper Structure (15 sections, 6 equations, 8 figures, 7 tables)

This paper contains 15 sections, 6 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Comparison of the reconstructed point clouds. Our method can alleviate the over-smoothing and misalignment artifacts, which is critical to the performance of downstream tasks.
  • Figure 2: Training trends of the uni-modal cross-entropy loss on SceneFlow dataset.
  • Figure 3: Illustration of our adaptive multi-modal modeling for cross-entropy loss. Given the pixel for modeling, the disparities within a pre-defined window are divided into $K$ clusters $\{\Omega_1,\Omega_2,...,\Omega_K\}$, and the mean $\mu_k$ for each cluster is calculated to form a uni-modal Laplacian distribution. The final adaptive multi-modal distribution is generated by the weighted summation of the Laplacian distributions, with the weight $w_k$ determined by $|\Omega_k|$.
  • Figure 4: Illustration of modal selection strategy during inference. SME SMNet prefers the modal with maximum probability candidate (aimed by the green arrow). Our proposed DME prefers the one with maximum cumulative probability.
  • Figure 5: Visualization of output distributions at the edge. Top row: background pixel, bottom row: foreground pixel.
  • ...and 3 more figures