Table of Contents
Fetching ...

UNCOVER: Unknown Class Object Detection for Autonomous Vehicles in Real-time

Lars Schmarje, Kaspar Sakman, Reinhard Koch, Dan Zhang

TL;DR

Autonomous driving requires recognizing unknown objects in open-world scenes. The paper introduces UNCOVER, a real-time detector that adds an explicit OOD class and an occupancy-based objectness head, trained with Mosaic+ augmentation from diverse domains to improve generalization while preserving known-class accuracy. A depth-based post-hoc filter further reduces false positives, leveraging geometric cues when depth maps are available. Across Cityscapes, BDD100k, Fishyscapes, and related benchmarks, UNCOVER yields up to 25% improvements in unknown-object recall and 18.4% reductions in false positives, with only a modest impact on runtime, demonstrating practical benefits for safer autonomous driving.

Abstract

Autonomous driving (AD) operates in open-world scenarios, where encountering unknown objects is inevitable. However, standard object detectors trained on a limited number of base classes tend to ignore any unknown objects, posing potential risks on the road. To address this, it is important to learn a generic rather than a class specific objectness from objects seen during training. We therefore introduce an occupancy prediction together with bounding box regression. It learns to score the objectness by calculating the ratio of the predicted area occupied by actual objects. To enhance its generalizability, we increase the object diversity by exploiting data from other domains via Mosaic and Mixup augmentation. The objects outside the AD training classes are classified as a newly added out-of-distribution (OOD) class. Our solution UNCOVER, for UNknown Class Object detection for autonomous VEhicles in Real-time, excels at achieving both real-time detection and high recall of unknown objects on challenging AD benchmarks. To further attain very low false positive rates, particularly for close objects, we introduce a post-hoc filtering step that utilizes geometric cues extracted from the depth map, typically available within the AD system.

UNCOVER: Unknown Class Object Detection for Autonomous Vehicles in Real-time

TL;DR

Autonomous driving requires recognizing unknown objects in open-world scenes. The paper introduces UNCOVER, a real-time detector that adds an explicit OOD class and an occupancy-based objectness head, trained with Mosaic+ augmentation from diverse domains to improve generalization while preserving known-class accuracy. A depth-based post-hoc filter further reduces false positives, leveraging geometric cues when depth maps are available. Across Cityscapes, BDD100k, Fishyscapes, and related benchmarks, UNCOVER yields up to 25% improvements in unknown-object recall and 18.4% reductions in false positives, with only a modest impact on runtime, demonstrating practical benefits for safer autonomous driving.

Abstract

Autonomous driving (AD) operates in open-world scenarios, where encountering unknown objects is inevitable. However, standard object detectors trained on a limited number of base classes tend to ignore any unknown objects, posing potential risks on the road. To address this, it is important to learn a generic rather than a class specific objectness from objects seen during training. We therefore introduce an occupancy prediction together with bounding box regression. It learns to score the objectness by calculating the ratio of the predicted area occupied by actual objects. To enhance its generalizability, we increase the object diversity by exploiting data from other domains via Mosaic and Mixup augmentation. The objects outside the AD training classes are classified as a newly added out-of-distribution (OOD) class. Our solution UNCOVER, for UNknown Class Object detection for autonomous VEhicles in Real-time, excels at achieving both real-time detection and high recall of unknown objects on challenging AD benchmarks. To further attain very low false positive rates, particularly for close objects, we introduce a post-hoc filtering step that utilizes geometric cues extracted from the depth map, typically available within the AD system.

Paper Structure

This paper contains 38 sections, 2 equations, 19 figures, 8 tables, 1 algorithm.

Figures (19)

  • Figure 1: Visualization of occupancy -- Here, UNCOVER was trained on Cityscapes cityscapes. Color code from blue, yellow, to red means low, medium and high occupancy. (a) UNCOVER predicts the highest occupancy on the known object classes from Cityscapes cityscapes, e.g, vehicle, person. (b) As OOD objects, the boxes on the ground from Fishyscapes Blum2021Fishy also have relatively high occupancy scores, compared to the background in blue. (c) It also responds to occluded objects, i.e., both vehicles and OOD object (trailer) from Anomaly Chan2021SMIYC.
  • Figure 2: UNCOVER -- We enable unknown object detection by adding OOD data (via Mosaic+), one extra class for OOD classification (dark blue), and one regression output (yellow) to predict the occupancy of each detection, i.e. phase I at training. In phase II, UNCOVER exploits the occupancy prediction to improve the recall of unknown objects in addition to the OOD class detection via the classification head. If depth information, commonly found in AD systems, is available, UNCOVER uses an interpretable filtering step to reduce false positive detections (see phase III). Note, in addition to Mosaic+, we also use Mixup; it is left out of the diagram for simplicity.
  • Figure 3: We compare three different objectness measures for ground truth (GT) bounding boxes (blue) and the predicted bounding box (dashed line, yellow). The following scores are the target for optimization, not the actual output. For (a) $Obj$Ge2021yolox, the score is one if positively matched to one GT box, or zero like in the presented case. For (b) $IoU$kim2021oln the highest IoU achieved with one of the GT boxes. For (c) our proposed $Occ$ concerns the intersection with all GT boxes and thus does not rely on a valid matching, like a). Thus, even when the localization is difficult, occupancy with one or more objects is easily determinable. Moreover, $Obj$ and $IoU$ may require matching classes, while we are class-agnostic, allowing better generalizability to unknown objects.
  • Figure 4: Examples where visual cues are not robust, yielding false positive detections.
  • Figure 5: Geometric cues from depth. For objects with geometric shapes such as the bird, it can be detected via the change of depth in the area. For road marking, there is no depth change. Therefore, depth can be exploited to filter non-objects in near range.
  • ...and 14 more figures