Table of Contents
Fetching ...

EI-Nexus: Towards Unmediated and Flexible Inter-Modality Local Feature Extraction and Matching for Event-Image Data

Zhonghua Yi, Hao Shi, Qi Jiang, Kailun Yang, Ze Wang, Diyang Gu, Yufan Zhang, Kaiwei Wang

TL;DR

This work proposes EI-Nexus, an unmediated and flexible framework that integrates two modality-specific keypoint extractors and a feature matcher, offering more unmediated and adaptable feature extraction and matching, achieving better keypoint similarity and state-of-the-art results on the MVSEC-RPE and ECRPE benchmarks.

Abstract

Event cameras, with high temporal resolution and high dynamic range, have limited research on the inter-modality local feature extraction and matching of event-image data. We propose EI-Nexus, an unmediated and flexible framework that integrates two modality-specific keypoint extractors and a feature matcher. To achieve keypoint extraction across viewpoint and modality changes, we bring Local Feature Distillation (LFD), which transfers the viewpoint consistency from a well-learned image extractor to the event extractor, ensuring robust feature correspondence. Furthermore, with the help of Context Aggregation (CA), a remarkable enhancement is observed in feature matching. We further establish the first two inter-modality feature matching benchmarks, MVSEC-RPE and EC-RPE, to assess relative pose estimation on event-image data. Our approach outperforms traditional methods that rely on explicit modal transformation, offering more unmediated and adaptable feature extraction and matching, achieving better keypoint similarity and state-of-the-art results on the MVSEC-RPE and EC-RPE benchmarks. The source code and benchmarks will be made publicly available at https://github.com/ZhonghuaYi/EI-Nexus_official.

EI-Nexus: Towards Unmediated and Flexible Inter-Modality Local Feature Extraction and Matching for Event-Image Data

TL;DR

This work proposes EI-Nexus, an unmediated and flexible framework that integrates two modality-specific keypoint extractors and a feature matcher, offering more unmediated and adaptable feature extraction and matching, achieving better keypoint similarity and state-of-the-art results on the MVSEC-RPE and ECRPE benchmarks.

Abstract

Event cameras, with high temporal resolution and high dynamic range, have limited research on the inter-modality local feature extraction and matching of event-image data. We propose EI-Nexus, an unmediated and flexible framework that integrates two modality-specific keypoint extractors and a feature matcher. To achieve keypoint extraction across viewpoint and modality changes, we bring Local Feature Distillation (LFD), which transfers the viewpoint consistency from a well-learned image extractor to the event extractor, ensuring robust feature correspondence. Furthermore, with the help of Context Aggregation (CA), a remarkable enhancement is observed in feature matching. We further establish the first two inter-modality feature matching benchmarks, MVSEC-RPE and EC-RPE, to assess relative pose estimation on event-image data. Our approach outperforms traditional methods that rely on explicit modal transformation, offering more unmediated and adaptable feature extraction and matching, achieving better keypoint similarity and state-of-the-art results on the MVSEC-RPE and EC-RPE benchmarks. The source code and benchmarks will be made publicly available at https://github.com/ZhonghuaYi/EI-Nexus_official.

Paper Structure

This paper contains 31 sections, 21 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Pipeline comparison. (a) Traditional pipelines muglikar2021calibratejiao2023lce apply explicit modality transformation first, then utilize image-based extraction and matching models. (b) Our framework directly extracts keypoints from events and images and then applies feature matching, which is simpler and more powerful.
  • Figure 2: Differences between (a) tracking using events and the reference frame and (b) inter-modality matching: (a) Event-based feature tracking methods estimate displacement $\mathbf{f}_j$ of event patch $\mathbf{P}_j$ at $t_j$ from original image patch $\mathbf{P}_0$ at $t_0$ that locate in the same location. (b) Our inter-modality matching method separately extracts keypoints and descriptors and matches them using cross-modality descriptors, without the predefined relationship between events and images.
  • Figure 3: Framework overview. The snowflakes represent no parameter optimization during training. Our framework follows a detector-based architecture including a local feature extraction stage and a feature matching stage. The event and image are separately sent to extractors to obtain the corresponding score map and descriptor map. Then the same keypoints extraction procedure is adopted for two branches, resulting in two keypoint sets. The two keypoint sets are sent to the matcher for feature matching, then the assignment matrix is finally estimated. During training, the event extractor is first trained through Local Feature Distillation (LFD), and then the matcher is trained through the ground-truth assignment calculated from the depth map and relative pose. Every component in the framework is modular, showing the flexibility of our design.
  • Figure 4: Qualitative results of keypoint similarity. Keypoints that satisfy the Repeatability criterion with $\epsilon{=}3$ are shown in green in the three rightmost columns, while the rest are in red. The Repeatability score for each method is marked at the bottom of the image. The event-to-video methods suffer from artifacts or inconsistent dynamic range, resulting in low Repeatability.
  • Figure 5: Qualitative results of matching results on the MVSEC-RPE dataset. SuperPoint is employed as the image extractor, while the fine-tuned LightGlue is utilized to estimate the matches. Correct matches are indicated by green lines and mismatches are by red lines.
  • ...and 4 more figures