Table of Contents
Fetching ...

MoRA: Missing Modality Low-Rank Adaptation for Visual Recognition

Shu Zhao, Nilesh Ahuja, Tan Yu, Tianyi Shen, Vijaykrishnan Narayanan

TL;DR

MoRA tackles the missing-modality problem in pretrained vision-language models by introducing a parameter-efficient fine-tuning framework that enables bidirectional cross-modal transfer while preserving modality-specific adaptations. It uses a dual-structure of modality-specific adapters and shared cross-modal parameters, with cross-modal interaction implemented in a dimension-agnostic rank-space via Gram matrices $\mathbf{G}^{\mathrm{v}}$ and $\mathbf{G}^{\mathrm{t}}$, and weight updates absorbed into the frozen backbone to avoid inference overhead. Empirically, MoRA delivers about $5.24\%$ average improvement in missing-modality scenarios, requires only $0.11\%$ of trainable parameters, and runs at $25.90\%$ of the SOTA method's inference time, validated on MM-IMDb, UPMC-Food101, and Hateful Memes, with additional extension to embedding tasks like CIR. This work demonstrates robust cross-scenario generalization, ablation-backed importance of Gram-based sharing, and favorable scalability across backbones, highlighting practical impact for deploying multimodal systems under modality scarcity.

Abstract

Pre-trained vision language models have shown remarkable performance on visual recognition tasks, but they typically assume the availability of complete multimodal inputs during both training and inference. In real-world scenarios, however, modalities may be missing due to privacy constraints, collection difficulties, or resource limitations. While previous approaches have addressed this challenge using prompt learning techniques, they fail to capture the cross-modal relationships necessary for effective multimodal visual recognition and suffer from inevitable computational overhead. In this paper, we introduce MoRA, a parameter-efficient fine-tuning method that explicitly models cross-modal interactions while maintaining modality-specific adaptations. MoRA introduces modality-common parameters between text and vision encoders, enabling bidirectional knowledge transfer. Additionally, combined with the modality-specific parameters, MoRA allows the backbone model to maintain inter-modality interaction and enable intra-modality flexibility. Extensive experiments on standard benchmarks demonstrate that MoRA achieves an average performance improvement in missing-modality scenarios by 5.24% and uses only 25.90% of the inference time compared to the SOTA method while requiring only 0.11% of trainable parameters compared to full fine-tuning.

MoRA: Missing Modality Low-Rank Adaptation for Visual Recognition

TL;DR

MoRA tackles the missing-modality problem in pretrained vision-language models by introducing a parameter-efficient fine-tuning framework that enables bidirectional cross-modal transfer while preserving modality-specific adaptations. It uses a dual-structure of modality-specific adapters and shared cross-modal parameters, with cross-modal interaction implemented in a dimension-agnostic rank-space via Gram matrices and , and weight updates absorbed into the frozen backbone to avoid inference overhead. Empirically, MoRA delivers about average improvement in missing-modality scenarios, requires only of trainable parameters, and runs at of the SOTA method's inference time, validated on MM-IMDb, UPMC-Food101, and Hateful Memes, with additional extension to embedding tasks like CIR. This work demonstrates robust cross-scenario generalization, ablation-backed importance of Gram-based sharing, and favorable scalability across backbones, highlighting practical impact for deploying multimodal systems under modality scarcity.

Abstract

Pre-trained vision language models have shown remarkable performance on visual recognition tasks, but they typically assume the availability of complete multimodal inputs during both training and inference. In real-world scenarios, however, modalities may be missing due to privacy constraints, collection difficulties, or resource limitations. While previous approaches have addressed this challenge using prompt learning techniques, they fail to capture the cross-modal relationships necessary for effective multimodal visual recognition and suffer from inevitable computational overhead. In this paper, we introduce MoRA, a parameter-efficient fine-tuning method that explicitly models cross-modal interactions while maintaining modality-specific adaptations. MoRA introduces modality-common parameters between text and vision encoders, enabling bidirectional knowledge transfer. Additionally, combined with the modality-specific parameters, MoRA allows the backbone model to maintain inter-modality interaction and enable intra-modality flexibility. Extensive experiments on standard benchmarks demonstrate that MoRA achieves an average performance improvement in missing-modality scenarios by 5.24% and uses only 25.90% of the inference time compared to the SOTA method while requiring only 0.11% of trainable parameters compared to full fine-tuning.

Paper Structure

This paper contains 33 sections, 4 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Motivation for MoRA. (a) Performance comparison on MM-IMDb and Food101 datasets using unaligned vision and text encoders. (b) Performance comparison using aligned CLIP ViT-B/16 encoder. (c) During pre-training, modalities are aligned in embedding space with a gap; during fine-tuning, modalities should maintain their relationship while allowing modality-specific adaptations.
  • Figure 2: Overview of the proposed MoRA architecture.
  • Figure 3: Generalizability Analysis on Hateful Memes dataset. (a) Models are trained on missing-both or missing-text cases, and evaluated on missing-text cases. (b) Models are trained on missing-both or missing-image cases, and evaluated on missing-image cases. (c) All models are trained on missing-both cases, and evaluated on missing-both cases.
  • Figure 4: Comparison of eigenvalue distributions between Gram matrices and pre-trained weights.
  • Figure 5: Performance scaling of MoRA with different backbone models.
  • ...and 5 more figures