Table of Contents
Fetching ...

Distilling Stereo Networks for Performant and Efficient Leaner Networks

Rafia Rahim, Samuel Woerz, Andreas Zell

TL;DR

This work tackles the limited exploration of knowledge distillation for stereo matching and introduces a joint distillation pipeline to transfer knowledge from strong stereo teachers to a lean, fast student. The proposed DSNet leverages multiple distillation points across backbone features, cost volume, cost aggregation, and disparity regression, guided by a suite of losses and a weighted objective. On SceneFlow, KITTI, and cross-domain datasets like ETH3D and Middlebury, DSNet achieves competitive accuracy with substantially fewer parameters and faster inference than teachers, demonstrating strong generalization. Overall, the paper provides a practical baseline and design methodology for distilling complex, multi-module stereo networks toward real-time deployment.

Abstract

Knowledge distillation has been quite popular in vision for tasks like classification and segmentation however not much work has been done for distilling state-of-the-art stereo matching methods despite their range of applications. One of the reasons for its lack of use in stereo matching networks is due to the inherent complexity of these networks, where a typical network is composed of multiple two- and three-dimensional modules. In this work, we systematically combine the insights from state-of-the-art stereo methods with general knowledge-distillation techniques to develop a joint framework for stereo networks distillation with competitive results and faster inference. Moreover, we show, via a detailed empirical analysis, that distilling knowledge from the stereo network requires careful design of the complete distillation pipeline starting from backbone to the right selection of distillation points and corresponding loss functions. This results in the student networks that are not only leaner and faster but give excellent performance . For instance, our student network while performing better than the performance oriented methods like PSMNet [1], CFNet [2], and LEAStereo [3]) on benchmark SceneFlow dataset, is 8x, 5x, and 8x faster respectively. Furthermore, compared to speed oriented methods having inference time less than 100ms, our student networks perform better than all the tested methods. In addition, our student network also shows better generalization capabilities when tested on unseen datasets like ETH3D and Middlebury.

Distilling Stereo Networks for Performant and Efficient Leaner Networks

TL;DR

This work tackles the limited exploration of knowledge distillation for stereo matching and introduces a joint distillation pipeline to transfer knowledge from strong stereo teachers to a lean, fast student. The proposed DSNet leverages multiple distillation points across backbone features, cost volume, cost aggregation, and disparity regression, guided by a suite of losses and a weighted objective. On SceneFlow, KITTI, and cross-domain datasets like ETH3D and Middlebury, DSNet achieves competitive accuracy with substantially fewer parameters and faster inference than teachers, demonstrating strong generalization. Overall, the paper provides a practical baseline and design methodology for distilling complex, multi-module stereo networks toward real-time deployment.

Abstract

Knowledge distillation has been quite popular in vision for tasks like classification and segmentation however not much work has been done for distilling state-of-the-art stereo matching methods despite their range of applications. One of the reasons for its lack of use in stereo matching networks is due to the inherent complexity of these networks, where a typical network is composed of multiple two- and three-dimensional modules. In this work, we systematically combine the insights from state-of-the-art stereo methods with general knowledge-distillation techniques to develop a joint framework for stereo networks distillation with competitive results and faster inference. Moreover, we show, via a detailed empirical analysis, that distilling knowledge from the stereo network requires careful design of the complete distillation pipeline starting from backbone to the right selection of distillation points and corresponding loss functions. This results in the student networks that are not only leaner and faster but give excellent performance . For instance, our student network while performing better than the performance oriented methods like PSMNet [1], CFNet [2], and LEAStereo [3]) on benchmark SceneFlow dataset, is 8x, 5x, and 8x faster respectively. Furthermore, compared to speed oriented methods having inference time less than 100ms, our student networks perform better than all the tested methods. In addition, our student network also shows better generalization capabilities when tested on unseen datasets like ETH3D and Middlebury.

Paper Structure

This paper contains 18 sections, 5 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Comparison of our student network with other state of the art methods on test set. Compared to teacher network (ACVNet xu2022acvnet), our proposed student network (DSNet (S)) has $3 \times$ fewer parameters and is $8 \times$ faster.
  • Figure 2: Proposed knowledge distillation pipeline for stereo matching. We use different distillation points with different types of loss functions to distill learned information from teacher to student. Here $loss_{fe}$ is feature extraction loss, $loss_{cv}$ is cost volume loss, $loss_{ca}$ is cost aggregation loss, $loss_{stpw}$ is student pixel-wise loss w.r.t. teacher and $loss_{spw}$ is student pixel-wise loss w.r.t. ground-truth(GT). $Loss$ is overall learning objective defined by accumulating each component's loss as explained in Sec. \ref{['sec:methodology']}.
  • Figure 3: Qualitative results on sample SceneFlow test images. In error maps, darker red means high disparity errors and as the color gets more blue (darker blue) it represents lower disparity errors.