Run-time Introspection of 2D Object Detection in Automated Driving Systems Using Learning Representations

Hakan Yekta Yatbaz; Mehrdad Dianati; Konstantinos Koufos; Roger Woodman

Run-time Introspection of 2D Object Detection in Automated Driving Systems Using Learning Representations

Hakan Yekta Yatbaz, Mehrdad Dianati, Konstantinos Koufos, Roger Woodman

TL;DR

A novel introspection solution is introduced, which operates at the frame level for DNN-based 2D object detection and leverages neural network activation patterns of the object detector's backbone using several different modes.

Abstract

Reliable detection of various objects and road users in the surrounding environment is crucial for the safe operation of automated driving systems (ADS). Despite recent progresses in developing highly accurate object detectors based on Deep Neural Networks (DNNs), they still remain prone to detection errors, which can lead to fatal consequences in safety-critical applications such as ADS. An effective remedy to this problem is to equip the system with run-time monitoring, named as introspection in the context of autonomous systems. Motivated by this, we introduce a novel introspection solution, which operates at the frame level for DNN-based 2D object detection and leverages neural network activation patterns. The proposed approach pre-processes the neural activation patterns of the object detector's backbone using several different modes. To provide extensive comparative analysis and fair comparison, we also adapt and implement several state-of-the-art (SOTA) introspection mechanisms for error detection in 2D object detection, using one-stage and two-stage object detectors evaluated on KITTI and BDD datasets. We compare the performance of the proposed solution in terms of error detection, adaptability to dataset shift, and, computational and memory resource requirements. Our performance evaluation shows that the proposed introspection solution outperforms SOTA methods, achieving an absolute reduction in the missed error ratio of 9% to 17% in the BDD dataset.

Run-time Introspection of 2D Object Detection in Automated Driving Systems Using Learning Representations

TL;DR

Abstract

Paper Structure (22 sections, 2 equations, 7 figures, 11 tables)

This paper contains 22 sections, 2 equations, 7 figures, 11 tables.

Introduction
Related Work
Proposed Introspection Model
Performance Evaluations
Object Detectors
Driving Datasets
Adapted SOTA Introspection Mechanisms
Statistical Features (SF)
Cascaded Learned Features (CLF)
Handcrafted Image & Model Features (HIMF)
Performance Metrics
Comparative Performance Evaluation between Different Pre-processing Modes
Comparative Performance Evaluation between Adapted Mechanisms
Detection performance
Cross-dataset performance
...and 7 more sections

Figures (7)

Figure 1: Actor-critic architecture for introspecting DNN-based ADS perception: Introspection can monitor perception input, intermediate model outputs, or the final output of the main system (or combinations of them). In case of an error, it should provide an alert to take further action such as handover or minimum risk manoeuvre koopman2016challenges. Frame-level rather than object-level introspection is within the scope of this paper.
Figure 2: Four-stage framework for the comparative analysis of introspection models: (1) training an object detection model specifically for driving scenarios, diverging from generic pre-training datasets such as COCO and Pascal VOC. (2) Generation of an error dataset associating the features and labels for introspection. (3) Training the introspection system using the error dataset from the validation set, and (4) evaluating the introspection system’s performance using the error dataset from the test set, with a corresponding feature, label pair. The dotted lines in the top right figure indicate that the input image and the output of the object detector might not be used for feature extraction by some of the introspection models (details will be provided in the description of each model).
Figure 3: Example visualisations of activation shaping techniques with a selected scaling parameter $p=80$%. 'Original' depicts unaltered neural activation maps from the backbone network. 'ASH-P' retains the original activation scale without modification, simply pruning 80% of the 'Original' map, ensuring a direct comparison, see Eq. (1). 'ASH-S' shows scaled activations after pruning, with the color scale adjusted to reflect redistributed activation intensities. 'ASH-B' represents binarised activations, with the scale indicating binary states of activation, diverging from the continuous values in 'Original', 'ASH-S' and 'ASH-P'. Each mode employs a different processing strategy, resulting in distinct scaling values, despite the similar visual patterns.
Figure 4: Architecture of the proposed mechanism in run-time: The top multi-coloured section illustrates the commonalities between the object detectors, including the backbone and Feature Pyramid Network (FPN) components. The distinction emerges post-FPN: The top line highlights the Faster-RCNN's utilisation of a Region Proposal Network (RPN) and Region of Interest (RoI) alignment for feature map standardisation before detection, while the bottom line emphasises FCOS's direct use of multi-scale feature maps to its detection block. Despite these differences, both methods employ a backbone structure from which our introspection mechanism requires access and extracts neural activation patterns. These patterns are then shaped before insertion into the introspection model for identifying errors. Finally, with reference to Fig. \ref{['fig:framework']}, ResNet50 is fine-tuned during Stage 1, while ResNet18 and the fully-connected network are trained during Stage 3 of the unified four-stage framework for the training and testing of metric-based introspection models.
Figure 5: Illustration of the feature extraction process from rahmanper. The 3D activation maps undergo global pooling operations (mean, max, and standard deviation) across their height and width. The processed 1D vectors are concatenated to form a unified column vector, serving as the learning representation titled as Statistical Features (SF).
...and 2 more figures

Run-time Introspection of 2D Object Detection in Automated Driving Systems Using Learning Representations

TL;DR

Abstract

Run-time Introspection of 2D Object Detection in Automated Driving Systems Using Learning Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (7)