Table of Contents
Fetching ...

EMatch: A Unified Framework for Event-based Optical Flow and Stereo Matching

Pengjie Zhang, Lin Zhu, Xiao Wang, Lizhi Wang, Wanxuan Lu, Hua Huang

TL;DR

This paper reformulates event-based flow estimation and stereo matching as a unified dense correspondence matching problem, enabling them to solve both tasks within a single model by directly matching features in a shared representation space.

Abstract

Event cameras have shown promise in vision applications like optical flow estimation and stereo matching, with many specialized architectures leveraging the asynchronous and sparse nature of event data. However, existing works only focus event data within the confines of task-specific domains, overlooking how tasks across the temporal and spatial domains can reinforce each other. In this paper, we reformulate event-based flow estimation and stereo matching as a unified dense correspondence matching problem, enabling us to solve both tasks within a single model by directly matching features in a shared representation space. Specifically, our method utilizes a Temporal Recurrent Network to aggregate event features across temporal or spatial domains, and a Spatial Contextual Attention to enhance knowledge transfer across event flows via temporal or spatial interactions. By utilizing a shared feature similarities module that integrates knowledge from event streams via temporal or spatial interactions, our network performs optical flow estimation from temporal event segment inputs and stereo matching from spatial event segment inputs simultaneously. We demonstrate that our unified model inherently supports multi-task fusion and cross-task transfer. Without the need for retraining for specific task, our model can effectively handle both optical flow and stereo estimation, achieving state-of-the-art performance on both tasks.

EMatch: A Unified Framework for Event-based Optical Flow and Stereo Matching

TL;DR

This paper reformulates event-based flow estimation and stereo matching as a unified dense correspondence matching problem, enabling them to solve both tasks within a single model by directly matching features in a shared representation space.

Abstract

Event cameras have shown promise in vision applications like optical flow estimation and stereo matching, with many specialized architectures leveraging the asynchronous and sparse nature of event data. However, existing works only focus event data within the confines of task-specific domains, overlooking how tasks across the temporal and spatial domains can reinforce each other. In this paper, we reformulate event-based flow estimation and stereo matching as a unified dense correspondence matching problem, enabling us to solve both tasks within a single model by directly matching features in a shared representation space. Specifically, our method utilizes a Temporal Recurrent Network to aggregate event features across temporal or spatial domains, and a Spatial Contextual Attention to enhance knowledge transfer across event flows via temporal or spatial interactions. By utilizing a shared feature similarities module that integrates knowledge from event streams via temporal or spatial interactions, our network performs optical flow estimation from temporal event segment inputs and stereo matching from spatial event segment inputs simultaneously. We demonstrate that our unified model inherently supports multi-task fusion and cross-task transfer. Without the need for retraining for specific task, our model can effectively handle both optical flow and stereo estimation, achieving state-of-the-art performance on both tasks.
Paper Structure (12 sections, 11 equations, 9 figures, 4 tables)

This paper contains 12 sections, 11 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of our unified framework. Previous works considered optical flow estimation and stereo matching as two separate tasks and designed many frameworks in their respective pipeline. We reformulate these two tasks as a dense correspondence matching problem and design a novel unified framework with a shared representation space.
  • Figure 2: Illustration of the similarity between event-based optical flow estimation and stereo matching by correspondence matching. We accumulate events during a sampling time $dt$ from different times ($T, T+\Delta T$) or different viewpoints ($Left, Right$) to get reference and target event streams $E_{1}, E_{2}$. By correspondence matching, we can get the displacement $D$ (i.e. flow or disparity) between them. Note that the event streams needs to be transformed into a high-dimensional domain through feature extraction $\mathcal{F}(\cdot)$.
  • Figure 3: Overall architecture of EMatch. It can be divided into four parts: 1) Feature Encoding. Temporal Recurrent Network (TRN) encode reference and target event voxel $V_{1}, V_{2}$ to get initial features $F_{1}, F_{2}$. 2) Feature Enhancement. Spatial Contextual Attention (SCA) enhance features $F_{1}, F_{2}$ to obtain dense feature maps $\hat{F_{1}}, \hat{F_{2}}$ for matching. 3) Correspondence Matching. The displacement $D$ (i.e. flow or disparity) is calculated by searching for the matching features with the highest similarity between reference and target feature map $\hat{F_{1}}, \hat{F_{2}}$. 4) Refinement. The displacement $D$ are further refined to finally get flow or disparity.
  • Figure 4: Detailed architecture of TRN. Firstly, event voxel $V$ is splited into K groups $\{V_{T_{i}}|i=0,...,K\}$ in chronological order. Then, they are fed into stacked ResBlock and ConvGRU recurrently to extract temporal features. Finally, we obtain a multi-layer features $\{F^{l=i}_{T_{K}}|i=0,1,2,3\}$ as the results. We can only use the last layer of features $F^{l=3}_{T_{K}}$ or other additional features for multi-scale optimization.
  • Figure 5: Visualizing features of EMatch within different domains. We apply PCA on intermediate features both for flow and stereo (i.e. temporal domain and spatial domain) with single-task training and multi-task training. After TRN and SCA, we can obtain a high-dimensional feature map for dense correspondence matching, and through multi-task training, the features of flow and stereo can be unified in the same representation space.
  • ...and 4 more figures