Table of Contents
Fetching ...

GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer

Ding Jia, Jianyuan Guo, Kai Han, Han Wu, Chao Zhang, Chang Xu, Xinghao Chen

TL;DR

GeminiFusion tackles the efficiency bottleneck of cross-modal fusion in vision transformers by introducing a pixel-wise fusion module that enforces spatially aligned token interactions, achieving linear complexity relative to the number of input tokens ($O(N c^2)$) versus the quadratic cost of full cross-attention ($O(N^2 c)$). It combines intra- and inter-modal attention with a relation discriminator and layer-adaptive noise to stabilize learning and balance information exchange. Empirically, GeminiFusion delivers state-of-the-art or competitive results in multimodal semantic segmentation, image-to-image translation, and 3D object detection across diverse backbones (e.g., SegFormer, Swin) and datasets (NYUDv2, SUN RGB-D, DeLiVER, Taskonomy, KITTI), while significantly reducing computational load. The approach is plug-and-play for common vision backbones and leverages unimodal pretraining, presenting a practical path toward efficient, scalable multimodal perception.

Abstract

Cross-modal transformers have demonstrated superiority in various vision tasks by effectively integrating different modalities. This paper first critiques prior token exchange methods which replace less informative tokens with inter-modal features, and demonstrate exchange based methods underperform cross-attention mechanisms, while the computational demand of the latter inevitably restricts its use with longer sequences. To surmount the computational challenges, we propose GeminiFusion, a pixel-wise fusion approach that capitalizes on aligned cross-modal representations. GeminiFusion elegantly combines intra-modal and inter-modal attentions, dynamically integrating complementary information across modalities. We employ a layer-adaptive noise to adaptively control their interplay on a per-layer basis, thereby achieving a harmonized fusion process. Notably, GeminiFusion maintains linear complexity with respect to the number of input tokens, ensuring this multimodal framework operates with efficiency comparable to unimodal networks. Comprehensive evaluations across multimodal image-to-image translation, 3D object detection and arbitrary-modal semantic segmentation tasks, including RGB, depth, LiDAR, event data, etc. demonstrate the superior performance of our GeminiFusion against leading-edge techniques. The PyTorch code is available at https://github.com/JiaDingCN/GeminiFusion

GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer

TL;DR

GeminiFusion tackles the efficiency bottleneck of cross-modal fusion in vision transformers by introducing a pixel-wise fusion module that enforces spatially aligned token interactions, achieving linear complexity relative to the number of input tokens () versus the quadratic cost of full cross-attention (). It combines intra- and inter-modal attention with a relation discriminator and layer-adaptive noise to stabilize learning and balance information exchange. Empirically, GeminiFusion delivers state-of-the-art or competitive results in multimodal semantic segmentation, image-to-image translation, and 3D object detection across diverse backbones (e.g., SegFormer, Swin) and datasets (NYUDv2, SUN RGB-D, DeLiVER, Taskonomy, KITTI), while significantly reducing computational load. The approach is plug-and-play for common vision backbones and leverages unimodal pretraining, presenting a practical path toward efficient, scalable multimodal perception.

Abstract

Cross-modal transformers have demonstrated superiority in various vision tasks by effectively integrating different modalities. This paper first critiques prior token exchange methods which replace less informative tokens with inter-modal features, and demonstrate exchange based methods underperform cross-attention mechanisms, while the computational demand of the latter inevitably restricts its use with longer sequences. To surmount the computational challenges, we propose GeminiFusion, a pixel-wise fusion approach that capitalizes on aligned cross-modal representations. GeminiFusion elegantly combines intra-modal and inter-modal attentions, dynamically integrating complementary information across modalities. We employ a layer-adaptive noise to adaptively control their interplay on a per-layer basis, thereby achieving a harmonized fusion process. Notably, GeminiFusion maintains linear complexity with respect to the number of input tokens, ensuring this multimodal framework operates with efficiency comparable to unimodal networks. Comprehensive evaluations across multimodal image-to-image translation, 3D object detection and arbitrary-modal semantic segmentation tasks, including RGB, depth, LiDAR, event data, etc. demonstrate the superior performance of our GeminiFusion against leading-edge techniques. The PyTorch code is available at https://github.com/JiaDingCN/GeminiFusion
Paper Structure (18 sections, 6 equations, 10 figures, 9 tables)

This paper contains 18 sections, 6 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Improvements of our $\gemini$GeminiFusion across five multimodal semantic segmentation tasks. GeminiFusion achieves +$2.6\%$, +$1.3\%$, +$2.8\%$, +$1.9\%$, and +$3.4\%$ performance gains. All training epoch numbers are aligned. D: Depth, E: Event, L: LiDAR.
  • Figure 2: (a) Overall architecture of GeminiFusion: our proposed GeminiFusion model is designed to be plug and play, allowing it to be seamlessly integrated into various vision backbones. (b) GeminiFusion module: performing pixel-wise fusion to enrich multimodal feature by utilizing aligned features from two modalities. (c) TokenFusion: swapping certain pixels between two features, but result in information loss. (d) Cross-attention: requires a significant amount of memory resources with quadratic complexity of input token.
  • Figure 3: Impact of the threshold on the exchange-based TokenFusion. Exchanging all tokens almost invariably yields the best outcomes.
  • Figure 4: Image-to-image translation results on the validation split of Taskonomy. Best view in color and zoom in.
  • Figure 5: Comparison of attention scores obtained from self-attention (intra-modality) and cross-attention (inter-modality). Left: with noise. Right: without noise.
  • ...and 5 more figures