YOLOPoint Joint Keypoint and Object Detection
Anton Backhaus, Thorsten Luettel, Hans-Joachim Wuensche
TL;DR
This work addresses the need for robust GNSS-independent SLAM and visual odometry in camera-based autonomous systems by jointly detecting keypoints, predicting descriptors, and locating objects in a single forward pass. It introduces YOLOPoint, a multi-task network that fuses a SuperPoint-inspired keypoint/descriptor stream with YOLOv5 in a CSPDarknet backbone, offering four model variants for real-time performance. The paper demonstrates competitive keypoint repeatability and homography estimation on HPatches and strong visual odometry performance on KITTI, particularly when dynamic points are filtered using predicted object boxes. The approach enables efficient, multi-task perception suitable for real-time autonomous driving and SLAM, with plans to integrate into SLAM pipelines and further improve dynamic-object robustness.
Abstract
Intelligent vehicles of the future must be capable of understanding and navigating safely through their surroundings. Camera-based vehicle systems can use keypoints as well as objects as low- and high-level landmarks for GNSS-independent SLAM and visual odometry. To this end we propose YOLOPoint, a convolutional neural network model that simultaneously detects keypoints and objects in an image by combining YOLOv5 and SuperPoint to create a single forward-pass network that is both real-time capable and accurate. By using a shared backbone and a light-weight network structure, YOLOPoint is able to perform competitively on both the HPatches and KITTI benchmarks.
