Table of Contents
Fetching ...

Reusable Architecture Growth for Continual Stereo Matching

Chenghao Zhang, Gaofeng Meng, Bin Fan, Kun Tian, Zhaoxiang Zhang, Shiming Xiang, Chunhong Pan

TL;DR

This work tackles continual stereo depth estimation under ever-changing scenes, where fixed architectures struggle with forgetting and domain shifts. It introduces Reusable Architecture Growth (RAG), combining task-specific neural unit search, architecture growth with reuse of prior units, and a Scene Router for adaptive inference paths, plus a proxy-supervised variant to operate in label-scarce real-world settings. The method is evaluated across diverse driving datasets in both supervised and self-supervised settings, demonstrating strong final performance with minimal forgetting and substantial reusability, along with ablations validating the importance of cell-level search, growth policy, and proxy supervision. Extensions to monocular depth estimation and stereo-based 3D object detection indicate practical applicability for continual 3D perception in real-world deployment.

Abstract

The remarkable performance of recent stereo depth estimation models benefits from the successful use of convolutional neural networks to regress dense disparity. Akin to most tasks, this needs gathering training data that covers a number of heterogeneous scenes at deployment time. However, training samples are typically acquired continuously in practical applications, making the capability to learn new scenes continually even more crucial. For this purpose, we propose to perform continual stereo matching where a model is tasked to 1) continually learn new scenes, 2) overcome forgetting previously learned scenes, and 3) continuously predict disparities at inference. We achieve this goal by introducing a Reusable Architecture Growth (RAG) framework. RAG leverages task-specific neural unit search and architecture growth to learn new scenes continually in both supervised and self-supervised manners. It can maintain high reusability during growth by reusing previous units while obtaining good performance. Additionally, we present a Scene Router module to adaptively select the scene-specific architecture path at inference. Comprehensive experiments on numerous datasets show that our framework performs impressively in various weather, road, and city circumstances and surpasses the state-of-the-art methods in more challenging cross-dataset settings. Further experiments also demonstrate the adaptability of our method to unseen scenes, which can facilitate end-to-end stereo architecture learning and practical deployment.

Reusable Architecture Growth for Continual Stereo Matching

TL;DR

This work tackles continual stereo depth estimation under ever-changing scenes, where fixed architectures struggle with forgetting and domain shifts. It introduces Reusable Architecture Growth (RAG), combining task-specific neural unit search, architecture growth with reuse of prior units, and a Scene Router for adaptive inference paths, plus a proxy-supervised variant to operate in label-scarce real-world settings. The method is evaluated across diverse driving datasets in both supervised and self-supervised settings, demonstrating strong final performance with minimal forgetting and substantial reusability, along with ablations validating the importance of cell-level search, growth policy, and proxy supervision. Extensions to monocular depth estimation and stereo-based 3D object detection indicate practical applicability for continual 3D perception in real-world deployment.

Abstract

The remarkable performance of recent stereo depth estimation models benefits from the successful use of convolutional neural networks to regress dense disparity. Akin to most tasks, this needs gathering training data that covers a number of heterogeneous scenes at deployment time. However, training samples are typically acquired continuously in practical applications, making the capability to learn new scenes continually even more crucial. For this purpose, we propose to perform continual stereo matching where a model is tasked to 1) continually learn new scenes, 2) overcome forgetting previously learned scenes, and 3) continuously predict disparities at inference. We achieve this goal by introducing a Reusable Architecture Growth (RAG) framework. RAG leverages task-specific neural unit search and architecture growth to learn new scenes continually in both supervised and self-supervised manners. It can maintain high reusability during growth by reusing previous units while obtaining good performance. Additionally, we present a Scene Router module to adaptively select the scene-specific architecture path at inference. Comprehensive experiments on numerous datasets show that our framework performs impressively in various weather, road, and city circumstances and surpasses the state-of-the-art methods in more challenging cross-dataset settings. Further experiments also demonstrate the adaptability of our method to unseen scenes, which can facilitate end-to-end stereo architecture learning and practical deployment.
Paper Structure (39 sections, 10 equations, 15 figures, 12 tables)

This paper contains 39 sections, 10 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: Schematic diagram of our framework deployed on real-world continuous driving scenes. The scene-specific architecture path chosen by Scene Router will be loaded for inference according to the scene type of input image.
  • Figure 2: Catastrophic forgetting in stereo matching. The deep stereo model is first trained on the cloudy scene and then finetuned on foggy, rainy and sunny scenes in sequence. The red boxes refer to the performance on each scene learned so far, while the blue boxes refer to the generalization performance on unseen scenes. Light colors represent low errors.
  • Figure 3: Overview of our Reusable Architecture Growth framework. For the current task $\mathcal{T}^t$, based on the previous model (a), we first search task-specific neural units of the Feature Net (marked as $F$) and Matching Net (marked as $M$) (b), then select suitable units to make the network grow (c), and finally train the selected specific model (d). At test time, the scene-specific architecture path (marked in red) is selected for inference according to the Scene Router (e). Best viewed in color.
  • Figure 4: The network architecture of the base model.
  • Figure 5: Cross-scene comparison results on each scene of DrivingStereo, KITTI raw, and Virtual KITTI datasets for supervised continual stereo.
  • ...and 10 more figures