Reusable Architecture Growth for Continual Stereo Matching
Chenghao Zhang, Gaofeng Meng, Bin Fan, Kun Tian, Zhaoxiang Zhang, Shiming Xiang, Chunhong Pan
TL;DR
This work tackles continual stereo depth estimation under ever-changing scenes, where fixed architectures struggle with forgetting and domain shifts. It introduces Reusable Architecture Growth (RAG), combining task-specific neural unit search, architecture growth with reuse of prior units, and a Scene Router for adaptive inference paths, plus a proxy-supervised variant to operate in label-scarce real-world settings. The method is evaluated across diverse driving datasets in both supervised and self-supervised settings, demonstrating strong final performance with minimal forgetting and substantial reusability, along with ablations validating the importance of cell-level search, growth policy, and proxy supervision. Extensions to monocular depth estimation and stereo-based 3D object detection indicate practical applicability for continual 3D perception in real-world deployment.
Abstract
The remarkable performance of recent stereo depth estimation models benefits from the successful use of convolutional neural networks to regress dense disparity. Akin to most tasks, this needs gathering training data that covers a number of heterogeneous scenes at deployment time. However, training samples are typically acquired continuously in practical applications, making the capability to learn new scenes continually even more crucial. For this purpose, we propose to perform continual stereo matching where a model is tasked to 1) continually learn new scenes, 2) overcome forgetting previously learned scenes, and 3) continuously predict disparities at inference. We achieve this goal by introducing a Reusable Architecture Growth (RAG) framework. RAG leverages task-specific neural unit search and architecture growth to learn new scenes continually in both supervised and self-supervised manners. It can maintain high reusability during growth by reusing previous units while obtaining good performance. Additionally, we present a Scene Router module to adaptively select the scene-specific architecture path at inference. Comprehensive experiments on numerous datasets show that our framework performs impressively in various weather, road, and city circumstances and surpasses the state-of-the-art methods in more challenging cross-dataset settings. Further experiments also demonstrate the adaptability of our method to unseen scenes, which can facilitate end-to-end stereo architecture learning and practical deployment.
