Table of Contents
Fetching ...

Enhancing multimodal cooperation via sample-level modality valuation

Yake Wei, Ruoxuan Feng, Zihe Wang, Di Hu

TL;DR

A sample-level modality valuation metric is introduced to evaluate the contribution of each modality for each sample and improves cooperation between modalities at sample-level by enhancing the discriminative ability of low-contributing modalities in a targeted manner.

Abstract

One primary topic of multimodal learning is to jointly incorporate heterogeneous information from different modalities. However most models often suffer from unsatisfactory multimodal cooperation which cannot jointly utilize all modalities well. Some methods are proposed to identify and enhance the worse learnt modality but they are often hard to provide the fine-grained observation of multimodal cooperation at sample-level with theoretical support. Hence it is essential to reasonably observe and improve the fine-grained cooperation between modalities especially when facing realistic scenarios where the modality discrepancy could vary across different samples. To this end we introduce a sample-level modality valuation metric to evaluate the contribution of each modality for each sample. Via modality valuation we observe that modality discrepancy indeed could be different at sample-level beyond the global contribution discrepancy at dataset-level. We further analyze this issue and improve cooperation between modalities at sample-level by enhancing the discriminative ability of low-contributing modalities in a targeted manner. Overall our methods reasonably observe the fine-grained uni-modal contribution and achieve considerable improvement. The source code and dataset are available at https://github.com/GeWu-Lab/Valuate-and-Enhance-Multimodal-Cooperation.

Enhancing multimodal cooperation via sample-level modality valuation

TL;DR

A sample-level modality valuation metric is introduced to evaluate the contribution of each modality for each sample and improves cooperation between modalities at sample-level by enhancing the discriminative ability of low-contributing modalities in a targeted manner.

Abstract

One primary topic of multimodal learning is to jointly incorporate heterogeneous information from different modalities. However most models often suffer from unsatisfactory multimodal cooperation which cannot jointly utilize all modalities well. Some methods are proposed to identify and enhance the worse learnt modality but they are often hard to provide the fine-grained observation of multimodal cooperation at sample-level with theoretical support. Hence it is essential to reasonably observe and improve the fine-grained cooperation between modalities especially when facing realistic scenarios where the modality discrepancy could vary across different samples. To this end we introduce a sample-level modality valuation metric to evaluate the contribution of each modality for each sample. Via modality valuation we observe that modality discrepancy indeed could be different at sample-level beyond the global contribution discrepancy at dataset-level. We further analyze this issue and improve cooperation between modalities at sample-level by enhancing the discriminative ability of low-contributing modalities in a targeted manner. Overall our methods reasonably observe the fine-grained uni-modal contribution and achieve considerable improvement. The source code and dataset are available at https://github.com/GeWu-Lab/Valuate-and-Enhance-Multimodal-Cooperation.
Paper Structure (17 sections, 8 equations, 6 figures, 4 tables, 2 algorithms)

This paper contains 17 sections, 8 equations, 6 figures, 4 tables, 2 algorithms.

Figures (6)

  • Figure 1: Accuracy improvement compared with joint training baseline of imbalanced multimodal learning methods, on Kinetics Sounds and our proposed MM-Debiased dataset. Other methods: OGM-GE peng2022balanced, Greedy wu2022characterizing and PMR fan2023pmr.
  • Figure 2: (a-b): Audio-visual samples of motorcycling category. (c): Our modality valuation of S.1 and S.2. S.1 and S.2 denotes Sample 1 and Sample 2 respectively. (d): Uni-modal average contribution over all training samples of different dataset. Our proposed MM-Debiased dataset has less global discrepancy at dataset-level, compared with other curated dataset.
  • Figure 3: Average contribution of each modality over all training samples during training for OGM-GE, Greedy, G-Blending and our methods on the UCF-101 dataset.
  • Figure 4: Valuation of two samples of motorcycling category.
  • Figure 5: Visual feature distribution of Concatenation, MMTM, MMTM-Sample and MMTM-Modality, visualized by t-SNE van2008visualizing on Kinetics Sounds dataset. As Figure \ref{['fig:shap-dataset']}, visual modality tends to be the low-contributing one. Categories are indicated in different colors.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Remark 1
  • Remark 2