MiPa: Mixed Patch Infrared-Visible Modality Agnostic Object Detection
Heitor R. Medeiros, David Latortue, Eric Granger, Marco Pedersoli
TL;DR
This work addresses RGB/IR object detection with a single shared transformer encoder by training on both modalities through patch-level mixing (MiPa). A patch-wise modality agnostic module, inspired by gradient reversal, enforces modality invariance to balance contributions during training while avoiding inference overhead. The method is validated on LLVIP and FLIR, showing competitive results against multimodal fusion approaches and enabling robust performance when only one modality is available at test time. By combining a flexible patch sampling ratio $\rho$ with a learnable MA objective, MiPa delivers modality-agnostic representations suitable for multiple transformer-based detectors, enabling practical, real-time cross-modality adoption in surveillance and automotive settings.
Abstract
In real-world scenarios, using multiple modalities like visible (RGB) and infrared (IR) can greatly improve the performance of a predictive task such as object detection (OD). Multimodal learning is a common way to leverage these modalities, where multiple modality-specific encoders and a fusion module are used to improve performance. In this paper, we tackle a different way to employ RGB and IR modalities, where only one modality or the other is observed by a single shared vision encoder. This realistic setting requires a lower memory footprint and is more suitable for applications such as autonomous driving and surveillance, which commonly rely on RGB and IR data. However, when learning a single encoder on multiple modalities, one modality can dominate the other, producing uneven recognition results. This work investigates how to efficiently leverage RGB and IR modalities to train a common transformer-based OD vision encoder, while countering the effects of modality imbalance. For this, we introduce a novel training technique to Mix Patches (MiPa) from the two modalities, in conjunction with a patch-wise modality agnostic module, for learning a common representation of both modalities. Our experiments show that MiPa can learn a representation to reach competitive results on traditional RGB/IR benchmarks while only requiring a single modality during inference. Our code is available at: https://github.com/heitorrapela/MiPa.
