Table of Contents
Fetching ...

Infrared Object Detection with Ultra Small ConvNets: Is ImageNet Pretraining Still Useful?

Srikanth Muralidharan, Heitor R. Medeiros, Masih Aminbeidokhti, Eric Granger, Marco Pedersoli

TL;DR

This work investigates whether ImageNet pretraining remains beneficial for ultra-small ConvNets aimed at infrared object detection on embedded devices. By downscaling EfficientNet-B0 and MobileNetV3 to ultra-small variants (B-1..B-7 and S-0..S-6) and evaluating with IN, IN→COCO, and scratch initializations, the authors quantify capacity-driven effects on cross-domain and cross-modality generalization. They show that pretraining benefits persist for moderate capacities but diminish as model size shrinks, with IN→COCO often outperforming other initializations in detection tasks, especially for easier shifts and larger backbones; for the smallest models, gains can disappear or reverse. The results yield practical guidance: use pretraining when possible, but avoid the ultra-small regime for deployment under domain shifts, and consider task-aligned pretraining (IN→COCO) for better out-of-domain robustness. The study provides a scalable scaling recipe and a comprehensive benchmark across detection and classification to inform embedded-system design and deployment choices.

Abstract

Many real-world applications require recognition models that are robust to different operational conditions and modalities, but at the same time run on small embedded devices, with limited hardware. While for normal size models, pre-training is known to be very beneficial in accuracy and robustness, for small models, that can be employed for embedded and edge devices, its effect is not clear. In this work, we investigate the effect of ImageNet pretraining on increasingly small backbone architectures (ultra-small models, with less than 1M parameters) with respect to robustness in downstream object detection tasks in the infrared visual modality. Using scaling laws derived from standard object recognition architectures, we construct two ultra-small backbone families and systematically study their performance. Our experiments on three different datasets reveal that while ImageNet pre-training is still useful, beyond a certain capacity threshold, it offers diminishing returns in terms of out-of-distribution detection robustness. Therefore, we advise practitioners to still use pre-training and, when possible avoid too small models as while they might work well for in-domain problems, they are brittle when working conditions are different.

Infrared Object Detection with Ultra Small ConvNets: Is ImageNet Pretraining Still Useful?

TL;DR

This work investigates whether ImageNet pretraining remains beneficial for ultra-small ConvNets aimed at infrared object detection on embedded devices. By downscaling EfficientNet-B0 and MobileNetV3 to ultra-small variants (B-1..B-7 and S-0..S-6) and evaluating with IN, IN→COCO, and scratch initializations, the authors quantify capacity-driven effects on cross-domain and cross-modality generalization. They show that pretraining benefits persist for moderate capacities but diminish as model size shrinks, with IN→COCO often outperforming other initializations in detection tasks, especially for easier shifts and larger backbones; for the smallest models, gains can disappear or reverse. The results yield practical guidance: use pretraining when possible, but avoid the ultra-small regime for deployment under domain shifts, and consider task-aligned pretraining (IN→COCO) for better out-of-domain robustness. The study provides a scalable scaling recipe and a comprehensive benchmark across detection and classification to inform embedded-system design and deployment choices.

Abstract

Many real-world applications require recognition models that are robust to different operational conditions and modalities, but at the same time run on small embedded devices, with limited hardware. While for normal size models, pre-training is known to be very beneficial in accuracy and robustness, for small models, that can be employed for embedded and edge devices, its effect is not clear. In this work, we investigate the effect of ImageNet pretraining on increasingly small backbone architectures (ultra-small models, with less than 1M parameters) with respect to robustness in downstream object detection tasks in the infrared visual modality. Using scaling laws derived from standard object recognition architectures, we construct two ultra-small backbone families and systematically study their performance. Our experiments on three different datasets reveal that while ImageNet pre-training is still useful, beyond a certain capacity threshold, it offers diminishing returns in terms of out-of-distribution detection robustness. Therefore, we advise practitioners to still use pre-training and, when possible avoid too small models as while they might work well for in-domain problems, they are brittle when working conditions are different.

Paper Structure

This paper contains 32 sections, 2 equations, 11 figures, 16 tables.

Figures (11)

  • Figure 1: Out‑of‑distribution mAP gain from ImageNet pre‑training on FLIR for ultra‑small models. The x‑axis reports model size (log‑scaled parameters); the y‑axis reports mAP when the detector is fine‑tuned on RGB and evaluated on IR. The red dashed line marks zero benefit: points above indicate a positive gain over random initialization. For backbones with $\gtrsim$100k parameters, the pre‑training advantage tends to grow, suggesting a monotonic link between capacity and cross‑domain generalization. In contrast, smaller networks hover unpredictably around zero, revealing no consistent trend at ultra‑low parameter counts.
  • Figure 2: Phases of our approach. We obtain initialization weights for our model families in two ways: using supervised pretraining (e.g. Imagenet classification / COCO detection) and random initialization. We then train these models for classification or detection tasks with each initialization in parallel on an In-domain detection dataset. Finally, we evaluate effectiveness of supervised pretraining by testing both models on cross-modal and cross-domain object detection tasks.
  • Figure 3: Cross-dataset generalization from FLIR to LLVIP RGB images. Results are reported for two model families, EfficientNet (left) and MobileNetV3 (right), across different model variants ($b{=}0$–$7$ and $s{=}0$–$6$). Curves compare three initialization strategies: IN$\rightarrow$COCO, IN, and Random. Performance trends show that IN$\rightarrow$COCO pretraining paradigm consistently yields stronger generalization, while Random initialization performs worst, especially for smaller model variants.
  • Figure 4: Cross-dataset generalization from LLVIP to FLIR RGB images. Performance is shown for two model families, EfficientNet (left) and MobileNetV3 (right), across different model variants ($b{=}0$–$7$ and $s{=}0$–$6$). Curves compare three initialization strategies: IN$\rightarrow$COCO, IN, and Random. Results indicate that larger model variants tend to generalize better, with IN$\rightarrow$COCO pretraining paradigm providing the most consistent improvements.
  • Figure 5: Modality adaptation of ultra-small EfficientNet and MobileNet models from RGB to Infrared domain on LLVIP and FLIR datasets. We observed that ImageNet pretraining is helpful only for the first few models for both families and datasets.
  • ...and 6 more figures