Identification of Fine-grained Systematic Errors via Controlled Scene Generation

Valentyn Boreiko; Matthias Hein; Jan Hendrik Metzen

Identification of Fine-grained Systematic Errors via Controlled Scene Generation

Valentyn Boreiko, Matthias Hein, Jan Hendrik Metzen

TL;DR

BEV2EGO proposes a controllable synthetic scene synthesis pipeline to identify fine-grained systematic errors in object detectors used in autonomous driving. It integrates BEV-to-EGO projection with a diffusion-based outpainting framework (ControlNet+Inpainting) to generate realistic multi-object street scenes under explicit attribute control. The paper introduces a Mean Median Score (MMS) metric and a 1200-scene benchmark to reveal detector weaknesses under rare appearance and occlusion scenarios, and it demonstrates a measurable Sim2Real gap with reasonable transfer to real images. The work highlights the importance of synthetic, controllable test data for detector auditing and suggests directions to reduce systematic errors and extend BEV2EGO to more object categories.

Abstract

Many safety-critical applications, especially in autonomous driving, require reliable object detectors. They can be very effectively assisted by a method to search for and identify potential failures and systematic errors before these detectors are deployed. Systematic errors are characterized by combinations of attributes such as object location, scale, orientation, and color, as well as the composition of their respective backgrounds. To identify them, one must rely on something other than real images from a test set because they do not account for very rare but possible combinations of attributes. To overcome this limitation, we propose a pipeline for generating realistic synthetic scenes with fine-grained control, allowing the creation of complex scenes with multiple objects. Our approach, BEV2EGO, allows for a realistic generation of the complete scene with road-contingent control that maps 2D bird's-eye view (BEV) scene configurations to a first-person view (EGO). In addition, we propose a benchmark for controlled scene generation to select the most appropriate generative outpainting model for BEV2EGO. We further use it to perform a systematic analysis of multiple state-of-the-art object detection models and discover differences between them.

Identification of Fine-grained Systematic Errors via Controlled Scene Generation

TL;DR

Abstract

Paper Structure (23 sections, 2 equations, 29 figures, 5 tables)

This paper contains 23 sections, 2 equations, 29 figures, 5 tables.

Introduction
Related Works
Method
BEV2EGO
Generative Outpainting Models
Experiments
Analysis of the Generative Outpainting Models
Systematic Errors in Object Detectors
Evaluation of Sim2Real gap
Conclusion and Limitations
Appendix - Overview
Details about BEV2EGO
Questions used in TIFA evaluation
Details about the Mean Median Score (MMS)
Extended analysis of systematic errors in object detectors
...and 8 more sections

Figures (29)

Figure 1: The BEV2EGO Method for Controlled Synthesis of Realistic Scenes. Our BEV2EGO method takes images generated with SCROD boreiko2023SCROD as input and consists of i) using the camera matrix $\mathbf{P}$, as described in Section \ref{['sec:BEV_method']}, to transform a bird's-eye view (BEV) of any scene into a first-person view (EGO); ii) computing the correct rotation angles under which a car is visible for a translated object (see Fig. \ref{['fig:BEV2EGO_base_2D_birds_eye_view']} for the discussion on why using the original angle is not appropriate); iii) conditioning with our approach towards realistic outpainting using ControlNet+Inpainting von-platen-etal-2022-diffusers, as described in Section \ref{['sec:generative_outpaintings_comparison']}.
Figure 2: BEV2EGO enables fine-grained analysis of object detectors. The two left images illustrate the projection from a BEV scene to an EGO perspective according to the pinhole camera model. The middle image displays the conditioning used for our outpainting approach with ControlNet+Inpainting von-platen-etal-2022-diffusers. The two right images show the outpainted version of the camera view as described in Section \ref{['sec:BEV_method']}. Occlusion can significantly degrade the performance of object detectors, such as shown here for YOLOv5n (refer to Table \ref{['tab:reproduced_AP']} and Section \ref{['sec:systematic_errors']} for more details): while in the second image from the right, the probability of the class "car" for the partially occluded blue car is 32%, this probability drastically decreases in the rightmost image with slightly increased occlusion, dropping to 0%. The resolution here is 512x512, and the prompt used is "cars are driving in a forest, high resolution, high definition, high quality."
Figure 3: Computing rotation angles from the camera's reference system to the car's reference system.On the left, the object is centered, and thus the rotation angle $\alpha = 45^{\circ}$ around its axis is identical to the angle $\hat{\alpha}$ from which it is visible. We define $\hat{\alpha}$ as the angle between the tangent to the circle around the camera's pinhole at the center of the car $(m_x, m_z)$ and the line from the camera's pinhole to $(m_x, m_z)$. On the right, the object is shifted to the right, resulting in $\hat{\alpha} = 90^{\circ}$, while $\alpha$ remains $45^{\circ}$.
Figure 4: Our suggested "CN+Inpaint" outpainting method that combines ControlNet (CN) with an Inpainting model von-platen-etal-2022-diffusers outperforms alternative outpainting approaches. We evaluate 120 synthetic scenes in which two cars are randomly positioned on streets. For each scene, we generate nine images with our BEV2EGO method. Examples of these images, which support the quantitative evaluation, are in Fig. \ref{['fig:outpainting_models_qualitative_comparison']}. SAM IoU measures the degree to which a method preserves the area of the masked cars after the outpainting. TIFAhu2023tifa measures the fine-grained alignment between the questions about the quality of the image and the generated images after outpainting. We evaluate it on the same set of questions, which we put in Appendix \ref{['app:questions_TIFA']}. MS SSIMmsssim, and the $\mathit{l_2}$ norm (of the difference) measure the degree to which a method preserves the area inside of the masked cars after the outpainting.
Figure 5: Limitations of generative outpainting methods (Section \ref{['sec:generative_outpaintings_comparison']}).Inpainting and LoRA lack road segmentation control, leading to unrealistic outputs and lower TIFA scores, as seen in all the rows of the respective columns. Inpainting additionally enlarges main object areas, reducing SAM IoU scores. ControlNet, although incorporating road segmentation, fails to preserve the image within the mask, causing artifacts and changing the color of the object inside of the mask. These deficiencies in criteria are not present in the proposed ControlNet+Inpainting (detailed in Section \ref{['sec:quantitative_analysis_generative']}). Quantitative evaluation of these methods is in Fig. \ref{['fig:outpainting_models_quantitative_comparison']}.
...and 24 more figures

Identification of Fine-grained Systematic Errors via Controlled Scene Generation

TL;DR

Abstract

Identification of Fine-grained Systematic Errors via Controlled Scene Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (29)