Exploring Camera Encoder Designs for Autonomous Driving Perception

Barath Lakshmanan; Joshua Chen; Shiyi Lan; Maying Shen; Zhiding Yu; Jose M. Alvarez

Exploring Camera Encoder Designs for Autonomous Driving Perception

Barath Lakshmanan, Joshua Chen, Shiyi Lan, Maying Shen, Zhiding Yu, Jose M. Alvarez

TL;DR

This work addresses the need for AV-specific camera encoders by starting from ConvNeXt and progressively tailoring micro- and macro-architectures to AV data characteristics, including long-range detection and multi-camera BEV fusion. The approach combines hardware-conscious block design (DriveNeXt), selective attention, and carefully tuned macro-architecture decisions (block width, stage count, block count, and stage compute ratio), plus high-resolution input handling. The optimized encoder achieves a relative mAP improvement of $8.79 ext{\%}$ over the vanilla baseline, with an additional $1.2 ext{\%}$ gain from a hybrid architecture, and demonstrates scalable variants (Tiny/Small/Base/Large) suitable for online and offline deployment, reaching $79.2\%$ mAP from $72.8\%$ on the AV dataset. These results underscore the value of domain-specific encoder customization for AV perception and offer a practical blueprint for building high-performance, deployable camera encoders in industry settings.

Abstract

The cornerstone of autonomous vehicles (AV) is a solid perception system, where camera encoders play a crucial role. Existing works usually leverage pre-trained Convolutional Neural Networks (CNN) or Vision Transformers (ViTs) designed for general vision tasks, such as image classification, segmentation, and 2D detection. Although those well-known architectures have achieved state-of-the-art accuracy in AV-related tasks, e.g., 3D Object Detection, there remains significant potential for improvement in network design due to the nuanced complexities of industrial-level AV dataset. Moreover, existing public AV benchmarks usually contain insufficient data, which might lead to inaccurate evaluation of those architectures.To reveal the AV-specific model insights, we start from a standard general-purpose encoder, ConvNeXt and progressively transform the design. We adjust different design parameters including width and depth of the model, stage compute ratio, attention mechanisms, and input resolution, supported by systematic analysis to each modifications. This customization yields an architecture optimized for AV camera encoder achieving 8.79% mAP improvement over the baseline. We believe our effort could become a sweet cookbook of image encoders for AV and pave the way to the next-level drive system.

Exploring Camera Encoder Designs for Autonomous Driving Perception

TL;DR

over the vanilla baseline, with an additional

gain from a hybrid architecture, and demonstrates scalable variants (Tiny/Small/Base/Large) suitable for online and offline deployment, reaching

mAP from

on the AV dataset. These results underscore the value of domain-specific encoder customization for AV perception and offer a practical blueprint for building high-performance, deployable camera encoders in industry settings.

Abstract

Paper Structure (21 sections, 8 figures, 1 table)

This paper contains 21 sections, 8 figures, 1 table.

Introduction
Related Work
AI Models for Encoder
Dataset for Autonomous Driving
nuScenes
Waymo Open Dataset
Base architecture design
Experiments and Results
Experimental setup
Micro-architecture Design
Block design
Adding attention blocks
Macro-architecture Design
Changing block width
Changing number of stages
...and 6 more sections

Figures (8)

Figure 1: Obstacle 3D Detection Pipeline from Multi-Camera Input. An image encoder extracts relevant features from each input image. The transformation stage projects the 2D image features into a unified 3D space, typically a bird's-eye view (BEV) representation. BEV Encoder-Decoder further processes the 3D features to refine spatial relationships and contextual information. Finally, the prediction stage generates the final 3D obstacle predictions, including their locations, classes, and other relevant attributes.
Figure 2: Architecture design of the base model. b1, b2, b3, b4 denote the number of blocks per stage. A hybrid DriveNeXt block replaces the regular block in 3rd stage to realize hybrid architecture.
Figure 3: Block evolution from ConvNeXt to DriveNeXt and DriveNeXt-Hybrid.
Figure 4: Hybrid model ablation: Attention layers and positioning analysis.
Figure 5: Early CNN stages benefit most from additional blocks, while later stages see diminishing returns.
...and 3 more figures

Exploring Camera Encoder Designs for Autonomous Driving Perception

TL;DR

Abstract

Exploring Camera Encoder Designs for Autonomous Driving Perception

Authors

TL;DR

Abstract

Table of Contents

Figures (8)