Table of Contents
Fetching ...

DynamicGlue: Epipolar and Time-Informed Data Association in Dynamic Environments using Graph Neural Networks

Theresa Huber, Simon Schaefer, Stefan Leutenegger

TL;DR

DynamicGlue tackles the challenge of dynamic environments in geometric vision by extending graph-based keypoint matching with epipolar and temporal cues within a sparse cross-image graph. It combines self- and cross-attentional aggregation with a LightGlue-inspired match head, and uses a self-supervised pipeline to generate pseudo-groundtruth from stereo–IMU data, explicitly excluding moving-object keypoints. The method achieves strong static matching performance while significantly reducing matches on dynamic objects, improving downstream SLAM/VIO accuracy in dynamic scenes. This dynamic-awareness, together with a lightweight, edge-feature-rich GNN, enables robust data association without relying on synthetic data or dense graphs, and demonstrates practical impact in real-world SLAM systems.

Abstract

The assumption of a static environment is common in many geometric computer vision tasks like SLAM but limits their applicability in highly dynamic scenes. Since these tasks rely on identifying point correspondences between input images within the static part of the environment, we propose a graph neural network-based sparse feature matching network designed to perform robust matching under challenging conditions while excluding keypoints on moving objects. We employ a similar scheme of attentional aggregation over graph edges to enhance keypoint representations as state-of-the-art feature-matching networks but augment the graph with epipolar and temporal information and vastly reduce the number of graph edges. Furthermore, we introduce a self-supervised training scheme to extract pseudo labels for image pairs in dynamic environments from exclusively unprocessed visual-inertial data. A series of experiments show the superior performance of our network as it excludes keypoints on moving objects compared to state-of-the-art feature matching networks while still achieving similar results regarding conventional matching metrics. When integrated into a SLAM system, our network significantly improves performance, especially in highly dynamic scenes.

DynamicGlue: Epipolar and Time-Informed Data Association in Dynamic Environments using Graph Neural Networks

TL;DR

DynamicGlue tackles the challenge of dynamic environments in geometric vision by extending graph-based keypoint matching with epipolar and temporal cues within a sparse cross-image graph. It combines self- and cross-attentional aggregation with a LightGlue-inspired match head, and uses a self-supervised pipeline to generate pseudo-groundtruth from stereo–IMU data, explicitly excluding moving-object keypoints. The method achieves strong static matching performance while significantly reducing matches on dynamic objects, improving downstream SLAM/VIO accuracy in dynamic scenes. This dynamic-awareness, together with a lightweight, edge-feature-rich GNN, enables robust data association without relying on synthetic data or dense graphs, and demonstrates practical impact in real-world SLAM systems.

Abstract

The assumption of a static environment is common in many geometric computer vision tasks like SLAM but limits their applicability in highly dynamic scenes. Since these tasks rely on identifying point correspondences between input images within the static part of the environment, we propose a graph neural network-based sparse feature matching network designed to perform robust matching under challenging conditions while excluding keypoints on moving objects. We employ a similar scheme of attentional aggregation over graph edges to enhance keypoint representations as state-of-the-art feature-matching networks but augment the graph with epipolar and temporal information and vastly reduce the number of graph edges. Furthermore, we introduce a self-supervised training scheme to extract pseudo labels for image pairs in dynamic environments from exclusively unprocessed visual-inertial data. A series of experiments show the superior performance of our network as it excludes keypoints on moving objects compared to state-of-the-art feature matching networks while still achieving similar results regarding conventional matching metrics. When integrated into a SLAM system, our network significantly improves performance, especially in highly dynamic scenes.
Paper Structure (15 sections, 8 equations, 4 figures, 2 tables)

This paper contains 15 sections, 8 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: In this paper, we present DynamicGlue, a matching framework for dynamic scenes. Compared to state-of-the-art approaches, our framework cannot only deal with large changes in appearance, such as viewpoint changes but also differentiate dynamic from static parts of the scene. Matched keypoints are shown in yellow, and unmatched keypoints in red.
  • Figure 2: The network architecture comprises three parts: Graph formation, attentional aggregation, and match assignment. After creating the graph from Superpoint DeTone2018 keypoints, epipolar and temporal information in the first step, an enhanced representation $\mathbf{n}$ of the initial descriptors $\mathbf{d}$ is computed using attentional aggregation over self and cross-edges in the second step. The third part computes a partial assignment based on the enhanced keypoint encodings.
  • Figure 3: Trajectories and drift statistics of our method vs. the baseline OKVIS2 Leutenegger2022 on the test sequence '2020-06-12_10-10-57' of the TUM4Season dataset wenzel2020fourseasons. The relative motion errors for position and orientation are aggregated over different sub-trajectory lengths. Our method drifts significantly less than the baseline, reducing especially the azimuth error.
  • Figure 4: Qualitative results of our framework (right) in various scenarios compared to LightGlue Lindenberger2023 (left). Matched keypoints are shown in yellow, unmatched in red. Our method can distinguish dynamic from static objects in different environments and diverse object types.