Table of Contents
Fetching ...

YOLOPoint Joint Keypoint and Object Detection

Anton Backhaus, Thorsten Luettel, Hans-Joachim Wuensche

TL;DR

This work addresses the need for robust GNSS-independent SLAM and visual odometry in camera-based autonomous systems by jointly detecting keypoints, predicting descriptors, and locating objects in a single forward pass. It introduces YOLOPoint, a multi-task network that fuses a SuperPoint-inspired keypoint/descriptor stream with YOLOv5 in a CSPDarknet backbone, offering four model variants for real-time performance. The paper demonstrates competitive keypoint repeatability and homography estimation on HPatches and strong visual odometry performance on KITTI, particularly when dynamic points are filtered using predicted object boxes. The approach enables efficient, multi-task perception suitable for real-time autonomous driving and SLAM, with plans to integrate into SLAM pipelines and further improve dynamic-object robustness.

Abstract

Intelligent vehicles of the future must be capable of understanding and navigating safely through their surroundings. Camera-based vehicle systems can use keypoints as well as objects as low- and high-level landmarks for GNSS-independent SLAM and visual odometry. To this end we propose YOLOPoint, a convolutional neural network model that simultaneously detects keypoints and objects in an image by combining YOLOv5 and SuperPoint to create a single forward-pass network that is both real-time capable and accurate. By using a shared backbone and a light-weight network structure, YOLOPoint is able to perform competitively on both the HPatches and KITTI benchmarks.

YOLOPoint Joint Keypoint and Object Detection

TL;DR

This work addresses the need for robust GNSS-independent SLAM and visual odometry in camera-based autonomous systems by jointly detecting keypoints, predicting descriptors, and locating objects in a single forward pass. It introduces YOLOPoint, a multi-task network that fuses a SuperPoint-inspired keypoint/descriptor stream with YOLOv5 in a CSPDarknet backbone, offering four model variants for real-time performance. The paper demonstrates competitive keypoint repeatability and homography estimation on HPatches and strong visual odometry performance on KITTI, particularly when dynamic points are filtered using predicted object boxes. The approach enables efficient, multi-task perception suitable for real-time autonomous driving and SLAM, with plans to integrate into SLAM pipelines and further improve dynamic-object robustness.

Abstract

Intelligent vehicles of the future must be capable of understanding and navigating safely through their surroundings. Camera-based vehicle systems can use keypoints as well as objects as low- and high-level landmarks for GNSS-independent SLAM and visual odometry. To this end we propose YOLOPoint, a convolutional neural network model that simultaneously detects keypoints and objects in an image by combining YOLOv5 and SuperPoint to create a single forward-pass network that is both real-time capable and accurate. By using a shared backbone and a light-weight network structure, YOLOPoint is able to perform competitively on both the HPatches and KITTI benchmarks.
Paper Structure (11 sections, 5 equations, 6 figures, 2 tables)

This paper contains 11 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Example output of YOLOPointM on a KITTI scene with keypoint tracks from 3 frames and object bounding boxes.
  • Figure 2: Full model architecture exemplary for YOLOPointS. The two types of bottlenecks, C3 block (left) and a sequence of convolution, batch normalization and SiLU activation form the main parts of YOLO and by extension YOLOPoint. $k$: kernel size, $s$: stride, $p$: pad, $c$: output channels, $bn$: bottleneck, SPPF: fast spacial pyramid pooling bib:yolov5.
  • Figure 3: HPatches matches between two images with viewpoint change estimated with YOLOPointS. Matched keypoints are used to estimate the homography matrix describing the viewpoint change.
  • Figure 4: Translation and rotation RMSE over all KITTI sequences plotted against mean VO estimation time for YOLOPointL (YPL), M, S and N with filtered points as well as SuperPoint and classical methods for comparison (lower left is better). VO estimation was done with $376 \times 1241$ images, NVIDIA RTX A4000 and Intel Core i7-11700K.
  • Figure 5: Sequence 01: Driving next to a car on a highway. Top: All keypoints. Bottom: Keypoints on car removed via its bounding box.
  • ...and 1 more figures