Table of Contents
Fetching ...

LiDAR-BIND-T: Improved and Temporally Consistent Sensor Modality Translation and Fusion for Robotic Applications

Niels Balemans, Ali Anwar, Jan Steckel, Siegfried Mercelis

TL;DR

LiDAR-BIND-T tackles temporal inconsistency in LiDAR-BIND by introducing three temporal mechanisms: embedding similarity across consecutive frames, a motion-aligned transformation loss, and temporal windowing with a fusion module, together with architectural refinements to preserve spatial structure. The approach retains LiDAR-BIND’s modular fusion while significantly improving temporal stability and SLAM performance, validated with domain-specific metrics such as FVMD and a correlation-peak displacement measure. Experiments on radar-to-LiDAR and sonar-to-LiDAR translation demonstrate lower APE and better occupancy-map accuracy in Cartographer-based SLAM, especially under degraded optical conditions. The work provides a practical, modular pathway to robust, temporally consistent multi-modal fusion for real-time robotic perception.

Abstract

This paper extends LiDAR-BIND, a modular multi-modal fusion framework that binds heterogeneous sensors (radar, sonar) to a LiDAR-defined latent space, with mechanisms that explicitly enforce temporal consistency. We introduce three contributions: (i) temporal embedding similarity that aligns consecutive latent representations, (ii) a motion-aligned transformation loss that matches displacement between predictions and ground truth LiDAR, and (iii) windowed temporal fusion using a specialised temporal module. We further update the model architecture to better preserve spatial structure. Evaluations on radar/sonar-to-LiDAR translation demonstrate improved temporal and spatial coherence, yielding lower absolute trajectory error and better occupancy map accuracy in Cartographer-based SLAM (Simultaneous Localisation and Mapping). We propose different metrics based on the Fréchet Video Motion Distance (FVMD) and a correlation-peak distance metric providing practical temporal quality indicators to evaluate SLAM performance. The proposed temporal LiDAR-BIND, or LiDAR-BIND-T, maintains modular modality fusion while substantially enhancing temporal stability, resulting in improved robustness and performance for downstream SLAM.

LiDAR-BIND-T: Improved and Temporally Consistent Sensor Modality Translation and Fusion for Robotic Applications

TL;DR

LiDAR-BIND-T tackles temporal inconsistency in LiDAR-BIND by introducing three temporal mechanisms: embedding similarity across consecutive frames, a motion-aligned transformation loss, and temporal windowing with a fusion module, together with architectural refinements to preserve spatial structure. The approach retains LiDAR-BIND’s modular fusion while significantly improving temporal stability and SLAM performance, validated with domain-specific metrics such as FVMD and a correlation-peak displacement measure. Experiments on radar-to-LiDAR and sonar-to-LiDAR translation demonstrate lower APE and better occupancy-map accuracy in Cartographer-based SLAM, especially under degraded optical conditions. The work provides a practical, modular pathway to robust, temporally consistent multi-modal fusion for real-time robotic perception.

Abstract

This paper extends LiDAR-BIND, a modular multi-modal fusion framework that binds heterogeneous sensors (radar, sonar) to a LiDAR-defined latent space, with mechanisms that explicitly enforce temporal consistency. We introduce three contributions: (i) temporal embedding similarity that aligns consecutive latent representations, (ii) a motion-aligned transformation loss that matches displacement between predictions and ground truth LiDAR, and (iii) windowed temporal fusion using a specialised temporal module. We further update the model architecture to better preserve spatial structure. Evaluations on radar/sonar-to-LiDAR translation demonstrate improved temporal and spatial coherence, yielding lower absolute trajectory error and better occupancy map accuracy in Cartographer-based SLAM (Simultaneous Localisation and Mapping). We propose different metrics based on the Fréchet Video Motion Distance (FVMD) and a correlation-peak distance metric providing practical temporal quality indicators to evaluate SLAM performance. The proposed temporal LiDAR-BIND, or LiDAR-BIND-T, maintains modular modality fusion while substantially enhancing temporal stability, resulting in improved robustness and performance for downstream SLAM.

Paper Structure

This paper contains 16 sections, 2 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Overview of the improved LiDAR-BIND framework for enhancing temporal consistency in modality translation and fusion. The updated framework fuses the different modalities over multiple time-steps, allowing for more accurate and stable predictions over time.
  • Figure 2: Overview of the LiDAR-BIND framework for multi-modal sensor fusion balemans_lidar-bind_2024. The framework aligns embeddings from different sensing modalities into a shared latent space, allowing for accurate and robust data fusion.
  • Figure 3: LiDAR-BIND latent fusion mechanism. The framework aligns embeddings from different sensing modalities into a shared latent space, allowing for accurate and robust data fusion. Note that this figure exaggerates the angular orientation of the vectors for illustrative purposes. In normal conditions, these vectors will be encoded closer to each other, while anomalies can be detected by a significant deviation in the angular orientation.
  • Figure 4: Overview of the different embedding space configurations tested in our experiments. (1) The first configuration is defined using radar data, (2) in the second configuration, the latent space is defined by the LiDAR data (similar to LiDAR-BIND), and (3) the third configuration combines both LiDAR and radar modalities to define the latent space at the same time. With this test, we aim to investigate the impact of embedding space design on model performance and prediction consistency.
  • Figure 5: Visualisation of the embedding similarity loss. Consecutive embeddings are specifically trained to be close in embedding space, promoting consistency in predictions.
  • ...and 6 more figures