Table of Contents
Fetching ...

Harmony: A Unified Framework for Modality Incremental Learning

Yaguang Song, Xiaoshan Yang, Dongmei Jiang, Yaowei Wang, Changsheng Xu

TL;DR

The paper tackles the challenge of Modality Incremental Learning (MIL), where a single model must continually learn across a sequence of distinct modalities using only unimodal data at each stage. It proposes Harmony, a transformer-based framework with Adaptive Compatible Feature Modulation to generate compatible historical features and Cumulative Modal Bridging to fuse historical knowledge with current learning via modality knowledge aggregation and a three-part Hybrid Alignment. The approach demonstrates superior performance on two MIL benchmarks, EPIC-MIL and Drive&Act-MIL, over a range of incremental-learning baselines, highlighting its ability to bridge modality gaps while preserving prior knowledge. By enabling effective modal connections and knowledge accumulation under data-restricted conditions, Harmony advances practical MIL and opens avenues for adding more modalities in open-world settings.

Abstract

Incremental learning aims to enable models to continuously acquire knowledge from evolving data streams while preserving previously learned capabilities. While current research predominantly focuses on unimodal incremental learning and multimodal incremental learning where the modalities are consistent, real-world scenarios often present data from entirely new modalities, posing additional challenges. This paper investigates the feasibility of developing a unified model capable of incremental learning across continuously evolving modal sequences. To this end, we introduce a novel paradigm called Modality Incremental Learning (MIL), where each learning stage involves data from distinct modalities. To address this task, we propose a novel framework named Harmony, designed to achieve modal alignment and knowledge retention, enabling the model to reduce the modal discrepancy and learn from a sequence of distinct modalities, ultimately completing tasks across multiple modalities within a unified framework. Our approach introduces the adaptive compatible feature modulation and cumulative modal bridging. Through constructing historical modal features and performing modal knowledge accumulation and alignment, the proposed components collaboratively bridge modal differences and maintain knowledge retention, even with solely unimodal data available at each learning phase.These components work in concert to establish effective modality connections and maintain knowledge retention, even when only unimodal data is available at each learning stage. Extensive experiments on the MIL task demonstrate that our proposed method significantly outperforms existing incremental learning methods, validating its effectiveness in MIL scenarios.

Harmony: A Unified Framework for Modality Incremental Learning

TL;DR

The paper tackles the challenge of Modality Incremental Learning (MIL), where a single model must continually learn across a sequence of distinct modalities using only unimodal data at each stage. It proposes Harmony, a transformer-based framework with Adaptive Compatible Feature Modulation to generate compatible historical features and Cumulative Modal Bridging to fuse historical knowledge with current learning via modality knowledge aggregation and a three-part Hybrid Alignment. The approach demonstrates superior performance on two MIL benchmarks, EPIC-MIL and Drive&Act-MIL, over a range of incremental-learning baselines, highlighting its ability to bridge modality gaps while preserving prior knowledge. By enabling effective modal connections and knowledge accumulation under data-restricted conditions, Harmony advances practical MIL and opens avenues for adding more modalities in open-world settings.

Abstract

Incremental learning aims to enable models to continuously acquire knowledge from evolving data streams while preserving previously learned capabilities. While current research predominantly focuses on unimodal incremental learning and multimodal incremental learning where the modalities are consistent, real-world scenarios often present data from entirely new modalities, posing additional challenges. This paper investigates the feasibility of developing a unified model capable of incremental learning across continuously evolving modal sequences. To this end, we introduce a novel paradigm called Modality Incremental Learning (MIL), where each learning stage involves data from distinct modalities. To address this task, we propose a novel framework named Harmony, designed to achieve modal alignment and knowledge retention, enabling the model to reduce the modal discrepancy and learn from a sequence of distinct modalities, ultimately completing tasks across multiple modalities within a unified framework. Our approach introduces the adaptive compatible feature modulation and cumulative modal bridging. Through constructing historical modal features and performing modal knowledge accumulation and alignment, the proposed components collaboratively bridge modal differences and maintain knowledge retention, even with solely unimodal data available at each learning phase.These components work in concert to establish effective modality connections and maintain knowledge retention, even when only unimodal data is available at each learning stage. Extensive experiments on the MIL task demonstrate that our proposed method significantly outperforms existing incremental learning methods, validating its effectiveness in MIL scenarios.

Paper Structure

This paper contains 20 sections, 13 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of the (c) Modality Incremental Learning (MIL) with conventional (a) Unimodal Incremental Learning and (b) Multimodal Incremental Learning.
  • Figure 2: Overview of the proposed framework for MIL. At the current phase $t$, we first obtain the modulated feature $F^{t-1}_{i}$ through the adaptive compatible feature modulation for modality $t-1$. Then we integrate the historical modal knowledge through the cumulative knowledge aggregation and achieve modality connection with hybrid alignment $\mathcal{L}_{align}$.
  • Figure 3: (a) Raw input features $F_{i}^{t}$ of RGB, Flow, and Audio. (b) Features $\hat{F}_{i}^{t}$ processed by the modality knowledge aggregation module.
  • Figure 4: Plot of analysis of hyperparameter $\lambda_{g}$, which is the intensity of feature perturbation for the feature modulation.
  • Figure 5: Plot of analysis of hyperparameter $\lambda$, which is a trade-off weight for balancing different losses.