Table of Contents
Fetching ...

PCDepth: Pattern-based Complementary Learning for Monocular Depth Estimation by Best of Both Worlds

Haotian Liu, Sanqing Qu, Fan Lu, Zongtao Bu, Florian Roehrbein, Alois Knoll, Guang Chen

TL;DR

PCDepth tackles monocular depth estimation with event data by shifting from pixel-level fusion to pattern-level complementary learning. It discretizes scenes into visual tokens through transposed attention and fuses image and event patterns with a learnable score-based mechanism, followed by a GRU-assisted, single-scale depth refinement. The approach yields significant gains on MVSEC and DSEC, notably a 37.9% improvement in nighttime MVSEC scenarios, demonstrating robustness in low-light conditions while balancing accuracy and efficiency. Overall, PCDepth advances pattern-level multimodal fusion for reliable depth estimation in challenging environments.

Abstract

Event cameras can record scene dynamics with high temporal resolution, providing rich scene details for monocular depth estimation (MDE) even at low-level illumination. Therefore, existing complementary learning approaches for MDE fuse intensity information from images and scene details from event data for better scene understanding. However, most methods directly fuse two modalities at pixel level, ignoring that the attractive complementarity mainly impacts high-level patterns that only occupy a few pixels. For example, event data is likely to complement contours of scene objects. In this paper, we discretize the scene into a set of high-level patterns to explore the complementarity and propose a Pattern-based Complementary learning architecture for monocular Depth estimation (PCDepth). Concretely, PCDepth comprises two primary components: a complementary visual representation learning module for discretizing the scene into high-level patterns and integrating complementary patterns across modalities and a refined depth estimator aimed at scene reconstruction and depth prediction while maintaining an efficiency-accuracy balance. Through pattern-based complementary learning, PCDepth fully exploits two modalities and achieves more accurate predictions than existing methods, especially in challenging nighttime scenarios. Extensive experiments on MVSEC and DSEC datasets verify the effectiveness and superiority of our PCDepth. Remarkably, compared with state-of-the-art, PCDepth achieves a 37.9% improvement in accuracy in MVSEC nighttime scenarios.

PCDepth: Pattern-based Complementary Learning for Monocular Depth Estimation by Best of Both Worlds

TL;DR

PCDepth tackles monocular depth estimation with event data by shifting from pixel-level fusion to pattern-level complementary learning. It discretizes scenes into visual tokens through transposed attention and fuses image and event patterns with a learnable score-based mechanism, followed by a GRU-assisted, single-scale depth refinement. The approach yields significant gains on MVSEC and DSEC, notably a 37.9% improvement in nighttime MVSEC scenarios, demonstrating robustness in low-light conditions while balancing accuracy and efficiency. Overall, PCDepth advances pattern-level multimodal fusion for reliable depth estimation in challenging environments.

Abstract

Event cameras can record scene dynamics with high temporal resolution, providing rich scene details for monocular depth estimation (MDE) even at low-level illumination. Therefore, existing complementary learning approaches for MDE fuse intensity information from images and scene details from event data for better scene understanding. However, most methods directly fuse two modalities at pixel level, ignoring that the attractive complementarity mainly impacts high-level patterns that only occupy a few pixels. For example, event data is likely to complement contours of scene objects. In this paper, we discretize the scene into a set of high-level patterns to explore the complementarity and propose a Pattern-based Complementary learning architecture for monocular Depth estimation (PCDepth). Concretely, PCDepth comprises two primary components: a complementary visual representation learning module for discretizing the scene into high-level patterns and integrating complementary patterns across modalities and a refined depth estimator aimed at scene reconstruction and depth prediction while maintaining an efficiency-accuracy balance. Through pattern-based complementary learning, PCDepth fully exploits two modalities and achieves more accurate predictions than existing methods, especially in challenging nighttime scenarios. Extensive experiments on MVSEC and DSEC datasets verify the effectiveness and superiority of our PCDepth. Remarkably, compared with state-of-the-art, PCDepth achieves a 37.9% improvement in accuracy in MVSEC nighttime scenarios.
Paper Structure (10 sections, 6 equations, 7 figures, 5 tables)

This paper contains 10 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Existing pixel-level fusion architecture vs our pattern-level fusion architecture. (a) Mainstream works take direct pixel-level fusion of features of two modalities, ignoring that the complementarity mainly impacts patterns of scene objects, which only occupy a few pixels. (b) Instead, our proposed pattern-level fusion architecture emphasizes the importance of high-level patterns of scene objects and conducts complementary learning of visual patterns, which achieves high-quality depth estimates.
  • Figure 2: The overall architecture of PCDepth. First, we use a general encoder-decoder pipeline to process event and image inputs. Encoded features from event and image modalities are derived from dual feature extractors. These two features are concatenated and sent to decoder layers to generate pixel embeddings. Second, the proposed complementary visual representation learning module discretizes features of both modalities through transposed attention and integrates complementary patterns into one set of visual tokens through score-based fusion. Finally, we reconstruct the scene to one single scale through cross-attention and exploit GRU blocks to refine depth estimates, maintaining an accuracy-efficiency balance.
  • Figure 3: Complementary visual representation learning. (Top) We use transposed attention to discretize both modalities into two sets of visual tokens. (Bottom) Each set of visual tokens is first enhanced through standard attention. Then the sum of enhanced visual tokens is entered into MLPs and Softmax operator to generate two score maps for both modalities.
  • Figure 4: The refined depth estimator design. To maintain an accuracy-efficiency balance, we transform the multi-scale pixel embeddings into one scale and exploit GRU to refine the depth estimates.
  • Figure 5: Qualitative results of MVSEC. We show the reference image and event frame in (a) and (b). (c) - (f) show results from RAMNet, baseline, our PCDepth and ground truth, respectively. Improvements are highlighted by red boxes.
  • ...and 2 more figures