Table of Contents
Fetching ...

Lite Any Stereo: Efficient Zero-Shot Stereo Matching

Junpeng Jing, Weixun Luo, Ye Mao, Krystian Mikolajczyk

TL;DR

Lite Any Stereo tackles the challenge of efficient zero-shot stereo matching with an ultra-light architecture that maintains competitive accuracy. It integrates a compact backbone with a hybrid 3D-2D cost aggregation and a three-stage million-scale training strategy to close the sim-to-real gap via synthetic supervision, self-distillation, and real-world pseudo-label distillation. The approach achieves state-of-the-art zero-shot accuracy among efficient methods, and even matches or surpasses non-prior-based accurate models while using less than 1% of their MACs. This enables practical deployment of reliable stereo depth estimation on resource-constrained devices.

Abstract

Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% computational cost, setting a new standard for efficient stereo matching.

Lite Any Stereo: Efficient Zero-Shot Stereo Matching

TL;DR

Lite Any Stereo tackles the challenge of efficient zero-shot stereo matching with an ultra-light architecture that maintains competitive accuracy. It integrates a compact backbone with a hybrid 3D-2D cost aggregation and a three-stage million-scale training strategy to close the sim-to-real gap via synthetic supervision, self-distillation, and real-world pseudo-label distillation. The approach achieves state-of-the-art zero-shot accuracy among efficient methods, and even matches or surpasses non-prior-based accurate models while using less than 1% of their MACs. This enables practical deployment of reliable stereo depth estimation on resource-constrained devices.

Abstract

Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% computational cost, setting a new standard for efficient stereo matching.

Paper Structure

This paper contains 14 sections, 5 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Zero-shot prediction on in-the-wild stereo images. The proposed method achieves accurate disparity estimation across diverse scenarios and maintains high efficiency, even on older-generation GPUs.
  • Figure 2: Zero-shot performance. Our method achieves SOTA by a large margin, with even better or comparable non-prior-based accurate model, while requiring less than 1% of their MACs.
  • Figure 3: Overview of the proposed Lite Any Stereo. Given an input stereo image pair, features are first extracted using a shared-weight feature extraction module. A correlation module then constructs cost volume from extracted features, which is processed by a hybrid 3D-2D cost aggregation module to obtain aggregated cost volume along both disparity and spatial dimensions. Finally, low-resolution disparity map is estimated and a convex upsampling operation is applied to recover the full-resolution disparity map.
  • Figure 4: Overview of the proposed three-stage training strategy. Stage ①: The lite model is trained using a standard supervised setup on a mixed of synthetic datasets including 1.8M labeled stereo image pairs. Stage ②: Self-distillation is employed, where both teacher and student models are initialized from the Stage ① weights. The teacher receives clean data, while the student is fed perturbed inputs to encourage learning of domain-invariant representations via feature alignment. Stage ③: The lite model is fine-tuned on unlabeled real-world data using pseudo labels generated by a frozen accurate model.
  • Figure 5: Design choices for hybrid cost aggregation module.
  • ...and 2 more figures