Interventional Imbalanced Multi-Modal Representation Learning via $β$-Generalization Front-Door Criterion
Yi Li, Fei Song, Changwen Zheng, Jiangmeng Li, Fuchun Sun, Hui Xiong
TL;DR
The paper tackles the problem of imbalanced modality contributions in multi-modal learning by introducing a causal view and a novel Interventional Imbalanced Multi-Modal Learning (IMML) framework. Central to IMML are two components: (1) a modality discriminative knowledge exploration network that learns and aligns per-modality discriminative features using a contrastive loss, and (2) a β-generalization front-door adjustment that estimates the causal effect of the predominant modality on the target while incorporating the auxiliary modality through a learned front-door mechanism. The authors derive the β-generalization front-door criterion, provide a dual-perspective derivation and a formal identifiability formula, and prove a generalization-bound showing that minimizing the discriminative knowledge loss improves out-of-distribution performance. Empirically, IMML yields significant improvements over strong baselines across multiple datasets and remains competitive with, or superior to, state-of-the-art plug-and-play methods, validating both its theoretical foundations and practical impact for robust, causally-informed multi-modal learning.
Abstract
Multi-modal methods establish comprehensive superiority over uni-modal methods. However, the imbalanced contributions of different modalities to task-dependent predictions constantly degrade the discriminative performance of canonical multi-modal methods. Based on the contribution to task-dependent predictions, modalities can be identified as predominant and auxiliary modalities. Benchmark methods raise a tractable solution: augmenting the auxiliary modality with a minor contribution during training. However, our empirical explorations challenge the fundamental idea behind such behavior, and we further conclude that benchmark approaches suffer from certain defects: insufficient theoretical interpretability and limited exploration capability of discriminative knowledge. To this end, we revisit multi-modal representation learning from a causal perspective and build the Structural Causal Model. Following the empirical explorations, we determine to capture the true causality between the discriminative knowledge of predominant modality and predictive label while considering the auxiliary modality. Thus, we introduce the $β$-generalization front-door criterion. Furthermore, we propose a novel network for sufficiently exploring multi-modal discriminative knowledge. Rigorous theoretical analyses and various empirical evaluations are provided to support the effectiveness of the innate mechanism behind our proposed method.
