Table of Contents
Fetching ...

Pip-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching

Jintu Zheng, Qizhe Liu, HuangXin Xu, Zhuojie Chen

TL;DR

This work introduces a progressive iteration pruning strategy that suppresses redundant update steps, effectively collapsing the recursive computation into a near-single-pass inference and proposes a collaborative monocular prior transfer framework that implicitly embeds depth priors without requiring a dedicated monocular encoder.

Abstract

While iterative stereo matching achieves high accuracy, its dependence on Recurrent Neural Networks (RNN) hinders edge deployment, a challenge underexplored in existing researches. We analyze iterative refinement and reveal that disparity updates are spatially sparse and temporally redundant. First, we introduce a progressive iteration pruning strategy that suppresses redundant update steps, effectively collapsing the recursive computation into a near-single-pass inference. Second, we propose a collaborative monocular prior transfer framework that implicitly embeds depth priors without requiring a dedicated monocular encoder, thereby eliminating its associated computational burden. Third, we develop FlashGRU, a hardware-aware RNN operator leveraging structured sparsity and I/O-conscious design, achieving a 7.28$\times$ speedup, 76.6\% memory peak reduction and 80.9\% global memory requests reduction over natvie ConvGRUs under 2K resolution. Our PipStereo enables real-time, high-fidelity stereo matching on edge hardware: it processes 320$\times$640 frames in just 75ms on an NVIDIA Jetson Orin NX (FP16) and 19ms on RTX 4090, matching the accuracy of large iterative based models, and our generalization ability and accuracy far exceeds that of existing real-time methods. Our embedded AI projects will be updated at: https://github.com/XPENG-Aridge-AI.

Pip-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching

TL;DR

This work introduces a progressive iteration pruning strategy that suppresses redundant update steps, effectively collapsing the recursive computation into a near-single-pass inference and proposes a collaborative monocular prior transfer framework that implicitly embeds depth priors without requiring a dedicated monocular encoder.

Abstract

While iterative stereo matching achieves high accuracy, its dependence on Recurrent Neural Networks (RNN) hinders edge deployment, a challenge underexplored in existing researches. We analyze iterative refinement and reveal that disparity updates are spatially sparse and temporally redundant. First, we introduce a progressive iteration pruning strategy that suppresses redundant update steps, effectively collapsing the recursive computation into a near-single-pass inference. Second, we propose a collaborative monocular prior transfer framework that implicitly embeds depth priors without requiring a dedicated monocular encoder, thereby eliminating its associated computational burden. Third, we develop FlashGRU, a hardware-aware RNN operator leveraging structured sparsity and I/O-conscious design, achieving a 7.28 speedup, 76.6\% memory peak reduction and 80.9\% global memory requests reduction over natvie ConvGRUs under 2K resolution. Our PipStereo enables real-time, high-fidelity stereo matching on edge hardware: it processes 320640 frames in just 75ms on an NVIDIA Jetson Orin NX (FP16) and 19ms on RTX 4090, matching the accuracy of large iterative based models, and our generalization ability and accuracy far exceeds that of existing real-time methods. Our embedded AI projects will be updated at: https://github.com/XPENG-Aridge-AI.
Paper Structure (15 sections, 7 equations, 4 figures, 4 tables)

This paper contains 15 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Iteration update positions and hit ratios on Middlebury middlebury test set. Red pixels denote updates; left color bars show consistency with the previous iteration. Refinements are sparse, highly redundant, and affect only a small fraction of the image.
  • Figure 2: Overview of two-stage training: (1) depth prior transfer via multi-level and cost volume feature alignment, and (2) pruned finetuning with progressively fewer iterations, training only the ConvGRU modules.
  • Figure 3: Illustrion of the FlashGRU in 2 resolution levels and T loops, with 70% sparisity.
  • Figure 4: Visual comparison. Leveraging depth prior, PipStereo successfully handles ill-posed regions and produces sharper details and more coherent structures compared to IGEVxu2023iterative.