Table of Contents
Fetching ...

Contrast-Guided Cross-Modal Distillation for Thermal Object Detection

SiWoo Kim, JhongHyun An

TL;DR

Problem: thermal detection at night struggles with low contrast and weak texture cues, causing missed and duplicate detections. Approach: CGDet trains with two objectives—RoI-level supervised contrastive learning ($L_{rcs}$) and cross-modal guidance ($L_{cms}$)—to sharpen instance boundaries and inject RGB semantics, with total loss $L_{total} = L_{det} + \lambda_{rcs} L_{rcs} + \lambda_{cms} L_{cms}$. Contributions: RoI-level contrastive learning improves global class separability; cross-modal feature distillation injects RGB priors into multi-level FPN features; experiments on FLIR show state-of-the-art mono-modality results and competitive performance versus multispectral methods, with a lightweight model (~3.15M params) and no test-time overhead. Significance: enables robust night-time detection with lower hardware cost and without extra sensors, suitable for autonomous systems.

Abstract

Robust perception at night remains challenging for thermal-infrared detection: low contrast and weak high-frequency cues lead to duplicate, overlapping boxes, missed small objects, and class confusion. Prior remedies either translate TIR to RGB and hope pixel fidelity transfers to detection -- making performance fragile to color or structure artifacts -- or fuse RGB and TIR at test time, which requires extra sensors, precise calibration, and higher runtime cost. Both lines can help in favorable conditions, but do not directly shape the thermal representation used by the detector. We keep mono-modality inference and tackle the root causes during training. Specifically, we introduce training-only objectives that sharpen instance-level decision boundaries by pulling together features of the same class and pushing apart those of different classes -- suppressing duplicate and confusing detections -- and that inject cross-modal semantic priors by aligning the student's multi-level pyramid features with an RGB-trained teacher, thereby strengthening texture-poor thermal features without visible input at test time. In experiments, our method outperformed prior approaches and achieved state-of-the-art performance.

Contrast-Guided Cross-Modal Distillation for Thermal Object Detection

TL;DR

Problem: thermal detection at night struggles with low contrast and weak texture cues, causing missed and duplicate detections. Approach: CGDet trains with two objectives—RoI-level supervised contrastive learning () and cross-modal guidance ()—to sharpen instance boundaries and inject RGB semantics, with total loss . Contributions: RoI-level contrastive learning improves global class separability; cross-modal feature distillation injects RGB priors into multi-level FPN features; experiments on FLIR show state-of-the-art mono-modality results and competitive performance versus multispectral methods, with a lightweight model (~3.15M params) and no test-time overhead. Significance: enables robust night-time detection with lower hardware cost and without extra sensors, suitable for autonomous systems.

Abstract

Robust perception at night remains challenging for thermal-infrared detection: low contrast and weak high-frequency cues lead to duplicate, overlapping boxes, missed small objects, and class confusion. Prior remedies either translate TIR to RGB and hope pixel fidelity transfers to detection -- making performance fragile to color or structure artifacts -- or fuse RGB and TIR at test time, which requires extra sensors, precise calibration, and higher runtime cost. Both lines can help in favorable conditions, but do not directly shape the thermal representation used by the detector. We keep mono-modality inference and tackle the root causes during training. Specifically, we introduce training-only objectives that sharpen instance-level decision boundaries by pulling together features of the same class and pushing apart those of different classes -- suppressing duplicate and confusing detections -- and that inject cross-modal semantic priors by aligning the student's multi-level pyramid features with an RGB-trained teacher, thereby strengthening texture-poor thermal features without visible input at test time. In experiments, our method outperformed prior approaches and achieved state-of-the-art performance.

Paper Structure

This paper contains 14 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: (a) Conventional thermal-to-visible(T2V). A pix2pix isola2017pix2pix-style translator with a U-Net generator and a PatchGAN discriminator. (b) TIRDet wang2023tirdet. A frozen T2V produces a pseudo-RGB that is early-fused with thermal. (c) Ours. The RCS module improves class separability in the feature space.
  • Figure 2: Overview of the Framework (a) The framework constructs with RoI Contrastive(RCS) Module, Cross-Modal Guidance(CMG) Module (b) RCS applies supervised contrastive learning to GT-aligned RoI embeddings to enforce a class-separable space. (c) CMG aligns teacher RGB features and student thermal RoI features via feature-level distillation; inference uses thermal only.
  • Figure 3: Qualitative comparison on FLIR dataset. Each triplet shows TIRDet(left), Ours(middle), and GT(right). Red, green, and blue boxes denote person, bike, and car, respectively.
  • Figure 4: Failure case on FLIR dataset. Top: Ours. Bottom: GT. Columns—left: motion blur and weak edges lead to a missed cyclist; middle, right: distant, small pedestrians are not detected.