Table of Contents
Fetching ...

FoRA: Low-Rank Adaptation Model beyond Multimodal Siamese Network

Weiying Xie, Yusi Zhang, Tianlin Hui, Jiaqing Zhang, Jie Lei, Yunsong Li

TL;DR

This work tackles distribution biases in multimodal object detection arising from two-stream backbones by introducing FoRA with a shared backbone and Low-rank Modal Adaptors (LMA). A dynamic adaptive rank allocation strategy tunes adaptor capacity across feature levels, balancing heterogeneity with parameter cost. Empirical results on DroneVehicle and LLVIP show state-of-the-art accuracy with dramatically reduced parameter growth, notably achieving a 10.4% mAP@0.5 gain on DroneVehicle with ~149M fewer parameters. The approach provides a scalable, efficient path for robust multimodal fusion in challenging visual environments.

Abstract

Multimodal object detection offers a promising prospect to facilitate robust detection in various visual conditions. However, existing two-stream backbone networks are challenged by complex fusion and substantial parameter increments. This is primarily due to large data distribution biases of multimodal homogeneous information. In this paper, we propose a novel multimodal object detector, named Low-rank Modal Adaptors (LMA) with a shared backbone. The shared parameters enhance the consistency of homogeneous information, while lightweight modal adaptors focus on modality unique features. Furthermore, we design an adaptive rank allocation strategy to adapt to the varying heterogeneity at different feature levels. When applied to two multimodal object detection datasets, experiments validate the effectiveness of our method. Notably, on DroneVehicle, LMA attains a 10.4% accuracy improvement over the state-of-the-art method with a 149M-parameters reduction. The code is available at https://github.com/zyszxhy/FoRA. Our work was submitted to ACM MM in April 2024, but was rejected. We will continue to refine our work and paper writing next, mainly including proof of theory and multi-task applications of FoRA.

FoRA: Low-Rank Adaptation Model beyond Multimodal Siamese Network

TL;DR

This work tackles distribution biases in multimodal object detection arising from two-stream backbones by introducing FoRA with a shared backbone and Low-rank Modal Adaptors (LMA). A dynamic adaptive rank allocation strategy tunes adaptor capacity across feature levels, balancing heterogeneity with parameter cost. Empirical results on DroneVehicle and LLVIP show state-of-the-art accuracy with dramatically reduced parameter growth, notably achieving a 10.4% mAP@0.5 gain on DroneVehicle with ~149M fewer parameters. The approach provides a scalable, efficient path for robust multimodal fusion in challenging visual environments.

Abstract

Multimodal object detection offers a promising prospect to facilitate robust detection in various visual conditions. However, existing two-stream backbone networks are challenged by complex fusion and substantial parameter increments. This is primarily due to large data distribution biases of multimodal homogeneous information. In this paper, we propose a novel multimodal object detector, named Low-rank Modal Adaptors (LMA) with a shared backbone. The shared parameters enhance the consistency of homogeneous information, while lightweight modal adaptors focus on modality unique features. Furthermore, we design an adaptive rank allocation strategy to adapt to the varying heterogeneity at different feature levels. When applied to two multimodal object detection datasets, experiments validate the effectiveness of our method. Notably, on DroneVehicle, LMA attains a 10.4% accuracy improvement over the state-of-the-art method with a 149M-parameters reduction. The code is available at https://github.com/zyszxhy/FoRA. Our work was submitted to ACM MM in April 2024, but was rejected. We will continue to refine our work and paper writing next, mainly including proof of theory and multi-task applications of FoRA.
Paper Structure (18 sections, 26 equations, 8 figures, 4 tables)

This paper contains 18 sections, 26 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Illustrations of changes in statistical data distributions for valid information and corresponding multimodal object detection structures. (a) Valid information (homogeneous information and heterogeneous information) exhibits different distribution biases in multimodal images. (b) Previous methods employ two-stream backbone (bottom) to extract features, bringing larger distribution biases. (c) We greatly reduce the biases and model scale by combining shared backbone and modal adaptors (bottom).
  • Figure 2: Overview of LMA (Taking convolutional backbone as an example). When a convolutional layer works, a new modality convolutional kernel is first generated by merging the modal adaptor weights and the shared convolutional kernel weights. Then the input data are convolved with the new modality kernel to output the extracted feature map. The details of this process are shown in Figure \ref{['fig3']} and Figure \ref{['fig4']}. “P3”, “P4” and “P5” represent the features need to be fused and then fed into detector.
  • Figure 3: Details of adaptor structure and mergence process of parameters (convolutional layer for example). Adaptors comprise three low-rank matrices $P$, $\Lambda$, $Q$, which generate the adaptor kernel by matrix multiplication and reshaping. Weights of modality kernel are derived from the sum of the shared layer weights and the adaptor kernel weights.
  • Figure 4: The equivalent data flow process for convolution of input data $X$ and ${\mathcal{K}}_{m o d a l i l t y}$. $X$ convolves with ${\mathcal{K}}_{s h a r e d}$ (top branch) and ${\mathcal{K}}_{a d a p t o r}$ (bottom branch) and results in output $Y$ after summation.
  • Figure 5: The data distribution biases (Pearson product-moment correlation coefficient ($| \rho|$) between two modality feature maps. The red line represents the baseline feature maps. The green and blue lines represent feature maps extracted by adaptor kernels and shared kernels respectively.
  • ...and 3 more figures