Lite Any Stereo: Efficient Zero-Shot Stereo Matching
Junpeng Jing, Weixun Luo, Ye Mao, Krystian Mikolajczyk
TL;DR
Lite Any Stereo tackles the challenge of efficient zero-shot stereo matching with an ultra-light architecture that maintains competitive accuracy. It integrates a compact backbone with a hybrid 3D-2D cost aggregation and a three-stage million-scale training strategy to close the sim-to-real gap via synthetic supervision, self-distillation, and real-world pseudo-label distillation. The approach achieves state-of-the-art zero-shot accuracy among efficient methods, and even matches or surpasses non-prior-based accurate models while using less than 1% of their MACs. This enables practical deployment of reliable stereo depth estimation on resource-constrained devices.
Abstract
Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% computational cost, setting a new standard for efficient stereo matching.
