Table of Contents
Fetching ...

MiPa: Mixed Patch Infrared-Visible Modality Agnostic Object Detection

Heitor R. Medeiros, David Latortue, Eric Granger, Marco Pedersoli

TL;DR

This work addresses RGB/IR object detection with a single shared transformer encoder by training on both modalities through patch-level mixing (MiPa). A patch-wise modality agnostic module, inspired by gradient reversal, enforces modality invariance to balance contributions during training while avoiding inference overhead. The method is validated on LLVIP and FLIR, showing competitive results against multimodal fusion approaches and enabling robust performance when only one modality is available at test time. By combining a flexible patch sampling ratio $\rho$ with a learnable MA objective, MiPa delivers modality-agnostic representations suitable for multiple transformer-based detectors, enabling practical, real-time cross-modality adoption in surveillance and automotive settings.

Abstract

In real-world scenarios, using multiple modalities like visible (RGB) and infrared (IR) can greatly improve the performance of a predictive task such as object detection (OD). Multimodal learning is a common way to leverage these modalities, where multiple modality-specific encoders and a fusion module are used to improve performance. In this paper, we tackle a different way to employ RGB and IR modalities, where only one modality or the other is observed by a single shared vision encoder. This realistic setting requires a lower memory footprint and is more suitable for applications such as autonomous driving and surveillance, which commonly rely on RGB and IR data. However, when learning a single encoder on multiple modalities, one modality can dominate the other, producing uneven recognition results. This work investigates how to efficiently leverage RGB and IR modalities to train a common transformer-based OD vision encoder, while countering the effects of modality imbalance. For this, we introduce a novel training technique to Mix Patches (MiPa) from the two modalities, in conjunction with a patch-wise modality agnostic module, for learning a common representation of both modalities. Our experiments show that MiPa can learn a representation to reach competitive results on traditional RGB/IR benchmarks while only requiring a single modality during inference. Our code is available at: https://github.com/heitorrapela/MiPa.

MiPa: Mixed Patch Infrared-Visible Modality Agnostic Object Detection

TL;DR

This work addresses RGB/IR object detection with a single shared transformer encoder by training on both modalities through patch-level mixing (MiPa). A patch-wise modality agnostic module, inspired by gradient reversal, enforces modality invariance to balance contributions during training while avoiding inference overhead. The method is validated on LLVIP and FLIR, showing competitive results against multimodal fusion approaches and enabling robust performance when only one modality is available at test time. By combining a flexible patch sampling ratio with a learnable MA objective, MiPa delivers modality-agnostic representations suitable for multiple transformer-based detectors, enabling practical, real-time cross-modality adoption in surveillance and automotive settings.

Abstract

In real-world scenarios, using multiple modalities like visible (RGB) and infrared (IR) can greatly improve the performance of a predictive task such as object detection (OD). Multimodal learning is a common way to leverage these modalities, where multiple modality-specific encoders and a fusion module are used to improve performance. In this paper, we tackle a different way to employ RGB and IR modalities, where only one modality or the other is observed by a single shared vision encoder. This realistic setting requires a lower memory footprint and is more suitable for applications such as autonomous driving and surveillance, which commonly rely on RGB and IR data. However, when learning a single encoder on multiple modalities, one modality can dominate the other, producing uneven recognition results. This work investigates how to efficiently leverage RGB and IR modalities to train a common transformer-based OD vision encoder, while countering the effects of modality imbalance. For this, we introduce a novel training technique to Mix Patches (MiPa) from the two modalities, in conjunction with a patch-wise modality agnostic module, for learning a common representation of both modalities. Our experiments show that MiPa can learn a representation to reach competitive results on traditional RGB/IR benchmarks while only requiring a single modality during inference. Our code is available at: https://github.com/heitorrapela/MiPa.
Paper Structure (18 sections, 8 equations, 6 figures, 8 tables)

This paper contains 18 sections, 8 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Differences in inputs for different modality learning. (a) Unimodal learning assumes that only one modality is used for both training and testing. (b) Multimodal learning requires multiple modalities and a special architecture to fuse them in order to improve performance. (c) Ours assumes that a model should be able to perform well for both modalities by using both for training but only one at a time for testing and with a shared vision encoder.
  • Figure 2: Mixed Patches (MiPa) with Modality Agnostic (MA) module. In yellow is the patchify function. In purple is the MiPa module, followed by the feature extractor (encoder). In green is the modality classifier, and in pink is the detection head.
  • Figure 3: Detection over different methods for two different daytimes: Night and Day and two different modalities: RGD and IR. Detectors trained on RGB work better in the daytime. Detectors trained on IR work better at nighttime. Detectors trained on Both modalities in a naive way cannot work only on the dominant modality. Our MiPa manages to work well in all conditions.
  • Figure 4: Our Both baseline for multimodal object detection learning with patches. The yellow block is the patchify function. In green, we have the block representing one or the other patch modality to use. In blue is the backbone, and in pink is the head of the detector.
  • Figure 5: Mix Patches diagram: First, in yellow, is the patchify function, which is responsible for providing the patches. Second, in purple, is the mix patches function, which is responsible for mixing the patches based on a pre-defined policy, e.g., uniform distribution of both modalities. Then, in blue is the backbone, and in pink is the detection head.
  • ...and 1 more figures