Table of Contents
Fetching ...

DeLo: Dual Decomposed Low-Rank Experts Collaboration for Continual Missing Modality Learning

Xiwei Liu, Yulong Li, Feilong Tang, Imran Razzak

TL;DR

DeLo is proposed, the first framework to leverage a novel dual-decomposed low-rank expert architecture for CMML, which resolves modality interference through decomposed LoRA expert, dynamically composing LoRA update matrix with rank-one factors from disentangled modality-specific factor pools.

Abstract

Adapting Large Multimodal Models (LMMs) to real-world scenarios poses the dual challenges of learning from sequential data streams while handling frequent modality incompleteness, a task known as Continual Missing Modality Learning (CMML). However, existing works on CMML have predominantly relied on prompt tuning, a technique that struggles with this task due to cross-task interference between its learnable prompts in their shared embedding space. A naive application of Low-Rank Adaptation (LoRA) with modality-shared module will also suffer modality interference from competing gradients. To this end, we propose DeLo, the first framework to leverage a novel dual-decomposed low-rank expert architecture for CMML. Specifically, this architecture resolves modality interference through decomposed LoRA expert, dynamically composing LoRA update matrix with rank-one factors from disentangled modality-specific factor pools. Embedded within a task-partitioned framework that structurally prevents catastrophic forgetting, this expert system is supported by two key mechanisms: a Cross-Modal Guided Routing strategy to handle incomplete data and a Task-Key Memory for efficient, task-agnostic inference. Extensive experiments on established CMML benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches. This highlights the value of a principled, architecturally-aware LoRA design for real-world multimodal challenges.

DeLo: Dual Decomposed Low-Rank Experts Collaboration for Continual Missing Modality Learning

TL;DR

DeLo is proposed, the first framework to leverage a novel dual-decomposed low-rank expert architecture for CMML, which resolves modality interference through decomposed LoRA expert, dynamically composing LoRA update matrix with rank-one factors from disentangled modality-specific factor pools.

Abstract

Adapting Large Multimodal Models (LMMs) to real-world scenarios poses the dual challenges of learning from sequential data streams while handling frequent modality incompleteness, a task known as Continual Missing Modality Learning (CMML). However, existing works on CMML have predominantly relied on prompt tuning, a technique that struggles with this task due to cross-task interference between its learnable prompts in their shared embedding space. A naive application of Low-Rank Adaptation (LoRA) with modality-shared module will also suffer modality interference from competing gradients. To this end, we propose DeLo, the first framework to leverage a novel dual-decomposed low-rank expert architecture for CMML. Specifically, this architecture resolves modality interference through decomposed LoRA expert, dynamically composing LoRA update matrix with rank-one factors from disentangled modality-specific factor pools. Embedded within a task-partitioned framework that structurally prevents catastrophic forgetting, this expert system is supported by two key mechanisms: a Cross-Modal Guided Routing strategy to handle incomplete data and a Task-Key Memory for efficient, task-agnostic inference. Extensive experiments on established CMML benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches. This highlights the value of a principled, architecturally-aware LoRA design for real-world multimodal challenges.
Paper Structure (19 sections, 11 equations, 5 figures, 6 tables)

This paper contains 19 sections, 11 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of continual missing modality learning.
  • Figure 2: Overview of our proposed DeLo framework. The left panel illustrates the task-partitioned architecture for continual learning. To prevent catastrophic forgetting, only the expert modules and classifier for the current task $k$ are trainable, while the main backbone and all modules for previous tasks (1 to $k-1$) are frozen. The right panel details the dual decomposed LoRA expert for a given Task $k$, which consists of Modality-Specific Factor Pools (Visual: $\mathcal{P}_k^{\mathrm{v}}=\{(a_e^{\mathrm{v}}, b_e^{\mathrm{v}})\}_{e=1}^E$ and Textual: $\mathcal{P}_k^{\mathrm{t}}=\{(a_e^{\mathrm{t}}, b_e^{\mathrm{t}})\}_{e=1}^E$). For each input, factors are dynamically selected from these pools to compose modality-specific weight adjustments $\Delta W^{\mathrm{v}} = \sum_{i=1}^{r} b_i^{\mathrm{v}} \otimes a_i^{\mathrm{v}}$ and $\Delta W^{\mathrm{t}} = \sum_{j=1}^{r} b_j^{\mathrm{t}} \otimes a_j^{\mathrm{t}}$, whose sum forms the final update $\Delta W_k$. The top of this panel shows our Cross-Modal Guided Routing for handling missing modalities, where the query from an available modality serves as a proxy for the missing one (e.g., $\hat{q}_{\mathrm{t}}:=q_{\mathrm{v}}$ for image-only input) and the Alignment Loss $\mathcal{L}_{\mathrm{align}}$ for complete data.
  • Figure 3: Comparison of LoRA parameterizations. (a) Conventional LoRA, where the weight adjustment matrix is represented as a product of two low-rank matrices ($BA$). (b) Decomposed LoRA, where the adjustment is expressed as a sum of $r$ rank-one matrices, each formed by the outer product of a vector pair $\sum_{i=1}^r b_i \otimes a_i$.
  • Figure 4: (a) t-SNE visualization of the final representations learned by the Task-Key Memory for each of the 10 continual tasks in UPMC-Food101-CMML. (b) The distribution of cosine similarity between visual and textual query signals, evaluated on modality-complete data.
  • Figure 5: Comparison of selection frequency of each factor expert for (a) the vision pool and (b) the text pool.