Table of Contents
Fetching ...

Enhancing Vehicle Detection under Adverse Weather Conditions with Contrastive Learning

Boying Li, Chang Liu, Petter Kyösti, Mattias Öhman, Devashish Singha Roy, Sofia Plazzi, Hamam Mokayed, Olle Hagner

TL;DR

This work tackles vehicle detection in UAV imagery under Nordic winter conditions where snow-induced domain shifts degrade performance. It introduces Sideload-Contrastive-Learning-Adaption (SCLA), a two-stage framework that pretrains a side CNN on unannotated data via Feature Map Patch-level Contrastive Learning (FM-PaCL) and then fuses its features with a frozen COCO-pretrained YOLO11n backbone using SE gating before detection heads. Empirical results on the Nordic Vehicle Dataset (NVD) show substantial improvements in $mAP_{50}$, with an $8.9\%$ gain under the NVD protocol and robust performance across alternative splits; ablations reveal that combining COCO pretraining with PaCL on unannotated data yields the strongest gains, while blockwise fusion can hinder performance. The approach enables improved, annotation-efficient vehicle detection suitable for edge devices, addressing domain gaps due to snow coverage and weather variability in Nordic UAV applications.

Abstract

Aside from common challenges in remote sensing like small, sparse targets and computation cost limitations, detecting vehicles from UAV images in the Nordic regions faces strong visibility challenges and domain shifts caused by diverse levels of snow coverage. Although annotated data are expensive, unannotated data is cheaper to obtain by simply flying the drones. In this work, we proposed a sideload-CL-adaptation framework that enables the use of unannotated data to improve vehicle detection using lightweight models. Specifically, we propose to train a CNN-based representation extractor through contrastive learning on the unannotated data in the pretraining stage, and then sideload it to a frozen YOLO11n backbone in the fine-tuning stage. To find a robust sideload-CL-adaptation, we conducted extensive experiments to compare various fusion methods and granularity. Our proposed sideload-CL-adaptation model improves the detection performance by 3.8% to 9.5% in terms of mAP50 on the NVD dataset.

Enhancing Vehicle Detection under Adverse Weather Conditions with Contrastive Learning

TL;DR

This work tackles vehicle detection in UAV imagery under Nordic winter conditions where snow-induced domain shifts degrade performance. It introduces Sideload-Contrastive-Learning-Adaption (SCLA), a two-stage framework that pretrains a side CNN on unannotated data via Feature Map Patch-level Contrastive Learning (FM-PaCL) and then fuses its features with a frozen COCO-pretrained YOLO11n backbone using SE gating before detection heads. Empirical results on the Nordic Vehicle Dataset (NVD) show substantial improvements in , with an gain under the NVD protocol and robust performance across alternative splits; ablations reveal that combining COCO pretraining with PaCL on unannotated data yields the strongest gains, while blockwise fusion can hinder performance. The approach enables improved, annotation-efficient vehicle detection suitable for edge devices, addressing domain gaps due to snow coverage and weather variability in Nordic UAV applications.

Abstract

Aside from common challenges in remote sensing like small, sparse targets and computation cost limitations, detecting vehicles from UAV images in the Nordic regions faces strong visibility challenges and domain shifts caused by diverse levels of snow coverage. Although annotated data are expensive, unannotated data is cheaper to obtain by simply flying the drones. In this work, we proposed a sideload-CL-adaptation framework that enables the use of unannotated data to improve vehicle detection using lightweight models. Specifically, we propose to train a CNN-based representation extractor through contrastive learning on the unannotated data in the pretraining stage, and then sideload it to a frozen YOLO11n backbone in the fine-tuning stage. To find a robust sideload-CL-adaptation, we conducted extensive experiments to compare various fusion methods and granularity. Our proposed sideload-CL-adaptation model improves the detection performance by 3.8% to 9.5% in terms of mAP50 on the NVD dataset.

Paper Structure

This paper contains 30 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The proposed Sideload-Contrastive-Learning-Adaption (SCLA) framework consists of two stages: (1) in the pretraining stage, unlabeled data are used to train a CNN-based feature extractor. The intermediate feature maps from this encoder are passed to FM-PaCL (Feature Map Patch-level Contrastive Learning), which enforces the extractor to learn fine-grained, spatially consistent representations by contrasting overlapping patches across augmented views.(2) In the fine-tuning stage, the annotated data feed into both the unannotated data pretrained feature extractor and YOLO11n model. Features from the pretrained feature extractor and YOLO11n's backbone are fused together with a feature fusion block, then the fused features are passed on to YOLO11n's neck and head to output the final detection. During the fine-tuning stage, the pretrained feature extractor and COCO pretrained YOLO11n's backbone are kept frozen.
  • Figure 2: Illustration of the Pretraining stage. In this stage, only photometric augmentations are applied to preserve the spatial alignment of the feature-map patches. The feature extractor is a CNN with the same architecture as YOLO11n's backbone, ensuring that the resulting feature-map dimensions match those of YOLO11n and enabling seamless integration during the Feature Fusion stage. C denotes the number of channels in the feature map.
  • Figure 3: Comparison of different gating mechanisms for feature fusion, including learnable weights, SE blocks, and Zero-Conv layers. * means channel-wise multiplication between the scalar and the feature map of the side CNN. + represents element-wise adding.
  • Figure 4: Test results of different fusion techniques. Each group of bars represents the fusion in the order of 'Frozen backbone', 'Addition', 'Learnable weights', 'Zero-Conv','SE'. Better viewed in color and zoomed in.
  • Figure 5: Blockwise feature fusion
  • ...and 1 more figures