Unveiling the Black Box: Independent Functional Module Evaluation for Bird's-Eye-View Perception Model

Ludan Zhang; Xiaokang Ding; Yuqi Dai; Lei He; Keqiang Li

Unveiling the Black Box: Independent Functional Module Evaluation for Bird's-Eye-View Perception Model

Ludan Zhang, Xiaokang Ding, Yuqi Dai, Lei He, Keqiang Li

TL;DR

BEV-IFME addresses the opacity of end-to-end BEV perception by introducing an Independent Functional Module Evaluation framework that maps GT and module feature maps into a shared semantic space (Re-Space) and measures their similarity. A two-stage Alignment AutoEncoder, guided by GT encodings from pre-trained LLMs, yields feature representations whose cosine similarity to GT representations yields a Robust Similarity Score correlated with BEV metrics like mAP and NDS (average 0.9387). The approach enables independent evaluation and hierarchical optimization of functional modules, demonstrating strong cross-configuration stability and guiding training adjustments. Validation on NuScenes-mini across eight module configurations confirms that Similarity Scores track BEV performance, supporting practical use for development efficiency and interpretability in autonomous driving systems.

Abstract

End-to-end models are emerging as the mainstream in autonomous driving perception. However, the inability to meticulously deconstruct their internal mechanisms results in diminished development efficacy and impedes the establishment of trust. Pioneering in the issue, we present the Independent Functional Module Evaluation for Bird's-Eye-View Perception Model (BEV-IFME), a novel framework that juxtaposes the module's feature maps against Ground Truth within a unified semantic Representation Space to quantify their similarity, thereby assessing the training maturity of individual functional modules. The core of the framework lies in the process of feature map encoding and representation aligning, facilitated by our proposed two-stage Alignment AutoEncoder, which ensures the preservation of salient information and the consistency of feature structure. The metric for evaluating the training maturity of functional modules, Similarity Score, demonstrates a robust positive correlation with BEV metrics, with an average correlation coefficient of 0.9387, attesting to the framework's reliability for assessment purposes.

Unveiling the Black Box: Independent Functional Module Evaluation for Bird's-Eye-View Perception Model

TL;DR

Abstract

Paper Structure (18 sections, 7 equations, 4 figures, 2 tables)

This paper contains 18 sections, 7 equations, 4 figures, 2 tables.

INTRODUCTION
RELATED WORK
Modular Networks
Clip Text and Feature Map Encoding
Evaluation Methods
PROBLEM FORMULATION
METHOD
Overall Architecture
Ground Truth Encoder
Feature Map Encoder
Feature Map Quality Metric
EXPERIMENT
Dataset
Metric
Function Module combination Settings for BEVFormer
...and 3 more sections

Figures (4)

Figure 1: The Overview of Independent Functional Module Evaluation Framework (BEV-IFME) on BEVFormer. By projecting feature maps $\mathcal{F}_{img}$, $\mathcal{F}_{bev}$ and $\mathcal{GT}_{2D}$, $\mathcal{GT}_{3D}$ into a shared semantic representation space (Re-Space) and measuring their Similarity Score, the BEV-IFME assesses the accuracy of the feature maps in capturing scene details and the degree of informational overlap with GT.
Figure 2: Two-Stage Alignment AutoEncoder Training Process. Feature maps from the 8 training phases serve as inputs. The initial phase of AutoEncoder training employs a self-supervised approach that extracts and reconstructs information, ensuring that the Feature Representation preserves the original Feature's information as much as possible. The subsequent phase achieves structural alignment by aligning with the GT Representation.
Figure 3: 2D Re-Space Similarity Scores and BEV Trends of 8 Module Configurations. In the 2D Re-Space, the average Similarity Score $\mathcal{S}$ between $\mathcal{F}_{img}$ and $\mathcal{GT}_{2D}$ encoded with SBERT is 0.8757, and the corresponding average, encoded with GPT-2, is 0.8807. Across various Module Configurations, a consistent upward trend is observed between the feature map quality evaluation metric $\mathcal{S}$, and the BEV Metric mAP and NDS.
Figure 4: 3D Re-Space Similarity Scores and BEV Trends of 8 Module Configurations. Within the 3D Re-Space, the mean Similarity Score $\mathcal{S}$ for the encoding of $\mathcal{F}_{bev}$ and $\mathcal{GT}_{3D}$ utilizing SBERT achieves 0.9983, and when encoded with GPT-2, it further elevates to 0.99996, markedly exceeding the analogous metrics within the 2D Re-Space. Concurrently, the correlation between $\mathcal{S}$ and the metrics mAP and NDS is found to be relatively weak, exhibiting an upward trend with occasional fluctuations.

Unveiling the Black Box: Independent Functional Module Evaluation for Bird's-Eye-View Perception Model

TL;DR

Abstract

Unveiling the Black Box: Independent Functional Module Evaluation for Bird's-Eye-View Perception Model

Authors

TL;DR

Abstract

Table of Contents

Figures (4)