Exploring Camera Encoder Designs for Autonomous Driving Perception
Barath Lakshmanan, Joshua Chen, Shiyi Lan, Maying Shen, Zhiding Yu, Jose M. Alvarez
TL;DR
This work addresses the need for AV-specific camera encoders by starting from ConvNeXt and progressively tailoring micro- and macro-architectures to AV data characteristics, including long-range detection and multi-camera BEV fusion. The approach combines hardware-conscious block design (DriveNeXt), selective attention, and carefully tuned macro-architecture decisions (block width, stage count, block count, and stage compute ratio), plus high-resolution input handling. The optimized encoder achieves a relative mAP improvement of $8.79 ext{\%}$ over the vanilla baseline, with an additional $1.2 ext{\%}$ gain from a hybrid architecture, and demonstrates scalable variants (Tiny/Small/Base/Large) suitable for online and offline deployment, reaching $79.2\%$ mAP from $72.8\%$ on the AV dataset. These results underscore the value of domain-specific encoder customization for AV perception and offer a practical blueprint for building high-performance, deployable camera encoders in industry settings.
Abstract
The cornerstone of autonomous vehicles (AV) is a solid perception system, where camera encoders play a crucial role. Existing works usually leverage pre-trained Convolutional Neural Networks (CNN) or Vision Transformers (ViTs) designed for general vision tasks, such as image classification, segmentation, and 2D detection. Although those well-known architectures have achieved state-of-the-art accuracy in AV-related tasks, e.g., 3D Object Detection, there remains significant potential for improvement in network design due to the nuanced complexities of industrial-level AV dataset. Moreover, existing public AV benchmarks usually contain insufficient data, which might lead to inaccurate evaluation of those architectures.To reveal the AV-specific model insights, we start from a standard general-purpose encoder, ConvNeXt and progressively transform the design. We adjust different design parameters including width and depth of the model, stage compute ratio, attention mechanisms, and input resolution, supported by systematic analysis to each modifications. This customization yields an architecture optimized for AV camera encoder achieving 8.79% mAP improvement over the baseline. We believe our effort could become a sweet cookbook of image encoders for AV and pave the way to the next-level drive system.
