Debiased Multimodal Understanding for Human Language Sequences

Zhi Xu; Dingkang Yang; Mingcheng Li; Yuzheng Wang; Zhaoyu Chen; Jiawei Chen; Jinjie Wei; Lihua Zhang

Debiased Multimodal Understanding for Human Language Sequences

Zhi Xu, Dingkang Yang, Mingcheng Li, Yuzheng Wang, Zhaoyu Chen, Jiawei Chen, Jinjie Wei, Lihua Zhang

TL;DR

This work identifies subject variation as a key confounder limiting generalization in multimodal language understanding (MLU). It frames MLU within a causal graph and introduces SuCI, a plug-in module that employs backdoor adjustment and a NWGM-inspired intervention to achieve $P(Y|do(X))$, thereby removing subject-specific spurious correlations. SuCI combines dynamic multimodal fusion, a subject feature generator, and a stratified confounder dictionary to construct confounders and perform calibrated intervention during training. Across MOSI, MOSEI, and UR_FUNNY, as well as cross-dataset evaluations, SuCI consistently boosts performance of diverse baselines, demonstrating improved robustness and generalization with substantial ablations and qualitative evidence supporting debiasing. The approach offers a broadly applicable, model-agnostic pathway to debiased MLU and could inform causal debiasing in other multimodal tasks.

Abstract

Human multimodal language understanding (MLU) is an indispensable component of expression analysis (e.g., sentiment or humor) from heterogeneous modalities, including visual postures, linguistic contents, and acoustic behaviours. Existing works invariably focus on designing sophisticated structures or fusion strategies to achieve impressive improvements. Unfortunately, they all suffer from the subject variation problem due to data distribution discrepancies among subjects. Concretely, MLU models are easily misled by distinct subjects with different expression customs and characteristics in the training data to learn subject-specific spurious correlations, limiting performance and generalizability across new subjects. Motivated by this observation, we introduce a recapitulative causal graph to formulate the MLU procedure and analyze the confounding effect of subjects. Then, we propose SuCI, a simple yet effective causal intervention module to disentangle the impact of subjects acting as unobserved confounders and achieve model training via true causal effects. As a plug-and-play component, SuCI can be widely applied to most methods that seek unbiased predictions. Comprehensive experiments on several MLU benchmarks clearly show the effectiveness of the proposed module.

Debiased Multimodal Understanding for Human Language Sequences

TL;DR

, thereby removing subject-specific spurious correlations. SuCI combines dynamic multimodal fusion, a subject feature generator, and a stratified confounder dictionary to construct confounders and perform calibrated intervention during training. Across MOSI, MOSEI, and UR_FUNNY, as well as cross-dataset evaluations, SuCI consistently boosts performance of diverse baselines, demonstrating improved robustness and generalization with substantial ablations and qualitative evidence supporting debiasing. The approach offers a broadly applicable, model-agnostic pathway to debiased MLU and could inform causal debiasing in other multimodal tasks.

Abstract

Paper Structure (15 sections, 9 equations, 5 figures, 6 tables)

This paper contains 15 sections, 9 equations, 5 figures, 6 tables.

Introduction
Related Work
Methodology
Structural Causal Graph in MLU Tasks
Causal Intervention via Backdoor Adjustment
Subject De-confounded Training with SuCI
Experiments
Benchmarks and Model Zoo
Implementation Details
Comparison with State-of-the-art Methods
Cross-dataset Evaluation
Ablation Studies
Qualitative Evaluation
Conclusion
Acknowledgments

Figures (5)

Figure 1: Examples on the MOSI benchmark illustrate the subject variation problem. Multimodal expressions from four subjects potentially convey distinct semantic correlations due to their different customs and styles in expressing sentiments.
Figure 2: The causal graph explains causal effects of MLU procedure. Nodes denote variables and arrows denote the direct causal effects. (a) The conventional likelihood estimation $P(\bm{Y}|\bm{X})$. (b) The causal intervention $P(\bm{Y}|do(\bm{X}))$.
Figure 3: A general MLU pipeline for the subject de-confounded training. The red dotted box shows the core component that achieves the approximation to causal intervention: our SuCI. SuCI can be readily integrated into the vanilla MLU model via backdoor adjustment to mitigate subject-specific spurious correlations and achieve debiased predictions in downstream tasks.
Figure 4: Ablation study results for the number of subject confounders on the UR_FUNNY benchmark.
Figure 5: Quantitative results (i.e., binary or seven classifications) of vanilla and SuCI-based DMD on the MOSEI.

Debiased Multimodal Understanding for Human Language Sequences

TL;DR

Abstract

Debiased Multimodal Understanding for Human Language Sequences

Authors

TL;DR

Abstract

Table of Contents

Figures (5)