Table of Contents
Fetching ...

Distill-then-prune: An Efficient Compression Framework for Real-time Stereo Matching Network on Edge Devices

Baiyu Pan, Jichao Jiao, Jianxing Pang, Jun Cheng

TL;DR

This work tackles the real-time stereo matching problem on edge devices by addressing the speed–accuracy trade-off with a Distill-Then-Prune framework. It presents a lightweight, implementation-friendly network that replaces 3D convolutions and iterative cost-volume construction with a channel-to-disparity approach, and augments it with knowledge distillation from a strong teacher and structured pruning (DepGraph) to achieve a compact, accurate model. Through extensive ablations on SceneFlow and KITTI, the authors demonstrate that teacher-only, L1-based distillation yields superior supervision, and that Setting3 of their module design provides the best efficiency–accuracy balance. The resulting DTPnet attains competitive or state-of-the-art performance among lightweight stereo methods while delivering real-time latency on edge platforms, with qualitative results showing robust disparity in challenging scenes. This framework is versatile and can be applied to existing stereo architectures, enabling practical deployment in robotics and autonomous systems.

Abstract

In recent years, numerous real-time stereo matching methods have been introduced, but they often lack accuracy. These methods attempt to improve accuracy by introducing new modules or integrating traditional methods. However, the improvements are only modest. In this paper, we propose a novel strategy by incorporating knowledge distillation and model pruning to overcome the inherent trade-off between speed and accuracy. As a result, we obtained a model that maintains real-time performance while delivering high accuracy on edge devices. Our proposed method involves three key steps. Firstly, we review state-of-the-art methods and design our lightweight model by removing redundant modules from those efficient models through a comparison of their contributions. Next, we leverage the efficient model as the teacher to distill knowledge into the lightweight model. Finally, we systematically prune the lightweight model to obtain the final model. Through extensive experiments conducted on two widely-used benchmarks, Sceneflow and KITTI, we perform ablation studies to analyze the effectiveness of each module and present our state-of-the-art results.

Distill-then-prune: An Efficient Compression Framework for Real-time Stereo Matching Network on Edge Devices

TL;DR

This work tackles the real-time stereo matching problem on edge devices by addressing the speed–accuracy trade-off with a Distill-Then-Prune framework. It presents a lightweight, implementation-friendly network that replaces 3D convolutions and iterative cost-volume construction with a channel-to-disparity approach, and augments it with knowledge distillation from a strong teacher and structured pruning (DepGraph) to achieve a compact, accurate model. Through extensive ablations on SceneFlow and KITTI, the authors demonstrate that teacher-only, L1-based distillation yields superior supervision, and that Setting3 of their module design provides the best efficiency–accuracy balance. The resulting DTPnet attains competitive or state-of-the-art performance among lightweight stereo methods while delivering real-time latency on edge platforms, with qualitative results showing robust disparity in challenging scenes. This framework is versatile and can be applied to existing stereo architectures, enabling practical deployment in robotics and autonomous systems.

Abstract

In recent years, numerous real-time stereo matching methods have been introduced, but they often lack accuracy. These methods attempt to improve accuracy by introducing new modules or integrating traditional methods. However, the improvements are only modest. In this paper, we propose a novel strategy by incorporating knowledge distillation and model pruning to overcome the inherent trade-off between speed and accuracy. As a result, we obtained a model that maintains real-time performance while delivering high accuracy on edge devices. Our proposed method involves three key steps. Firstly, we review state-of-the-art methods and design our lightweight model by removing redundant modules from those efficient models through a comparison of their contributions. Next, we leverage the efficient model as the teacher to distill knowledge into the lightweight model. Finally, we systematically prune the lightweight model to obtain the final model. Through extensive experiments conducted on two widely-used benchmarks, Sceneflow and KITTI, we perform ablation studies to analyze the effectiveness of each module and present our state-of-the-art results.
Paper Structure (17 sections, 5 equations, 3 figures, 7 tables, 1 algorithm)

This paper contains 17 sections, 5 equations, 3 figures, 7 tables, 1 algorithm.

Figures (3)

  • Figure 1: Latency vs. D1 error on the KITTI 2015kitti2015 validation set. The unit of Latency is millisecond/frame. Both metrics are lower the better. As shown, our DTPnet achieves a good balance between accuracy and speed.
  • Figure 2: Disparity regression module. The stacked hourglass module is composed by convolutions and transpose convolutions. The arrow line above denotes the skip connection.
  • Figure 3: Qualitative comparison (disparity results and error maps) of aafsChang_2020_ACCV, StereoVAEStereoVAE, and our DTPnet. Warmer colors in error maps indicate larger error.