Debiased Multimodal Understanding for Human Language Sequences
Zhi Xu, Dingkang Yang, Mingcheng Li, Yuzheng Wang, Zhaoyu Chen, Jiawei Chen, Jinjie Wei, Lihua Zhang
TL;DR
This work identifies subject variation as a key confounder limiting generalization in multimodal language understanding (MLU). It frames MLU within a causal graph and introduces SuCI, a plug-in module that employs backdoor adjustment and a NWGM-inspired intervention to achieve $P(Y|do(X))$, thereby removing subject-specific spurious correlations. SuCI combines dynamic multimodal fusion, a subject feature generator, and a stratified confounder dictionary to construct confounders and perform calibrated intervention during training. Across MOSI, MOSEI, and UR_FUNNY, as well as cross-dataset evaluations, SuCI consistently boosts performance of diverse baselines, demonstrating improved robustness and generalization with substantial ablations and qualitative evidence supporting debiasing. The approach offers a broadly applicable, model-agnostic pathway to debiased MLU and could inform causal debiasing in other multimodal tasks.
Abstract
Human multimodal language understanding (MLU) is an indispensable component of expression analysis (e.g., sentiment or humor) from heterogeneous modalities, including visual postures, linguistic contents, and acoustic behaviours. Existing works invariably focus on designing sophisticated structures or fusion strategies to achieve impressive improvements. Unfortunately, they all suffer from the subject variation problem due to data distribution discrepancies among subjects. Concretely, MLU models are easily misled by distinct subjects with different expression customs and characteristics in the training data to learn subject-specific spurious correlations, limiting performance and generalizability across new subjects. Motivated by this observation, we introduce a recapitulative causal graph to formulate the MLU procedure and analyze the confounding effect of subjects. Then, we propose SuCI, a simple yet effective causal intervention module to disentangle the impact of subjects acting as unobserved confounders and achieve model training via true causal effects. As a plug-and-play component, SuCI can be widely applied to most methods that seek unbiased predictions. Comprehensive experiments on several MLU benchmarks clearly show the effectiveness of the proposed module.
