Invariant Representation Guided Multimodal Sentiment Decoding with Sequential Variation Regularization
Guoyang Xu, Zhenxi Song, Junqi Xue, Yuxin Liu, Zirui Wang, Zhiguo Zhang
TL;DR
This work tackles robust multimodal sentiment analysis by addressing both cross-modal alignment and temporal stability. It introduces a dual strategy: (1) adversarial modality disentanglement to learn invariant $I_i$ and modality-specific $S_i$ representations and fuse them through an invariant-guided fusion module, and (2) sequential variation regularization to stabilize temporal dynamics, using a temporal invariant loss $L_{ti}$ derived from frame-to-frame divergence. The approach combines a CMD-based consistency objective, gradient-reversal adversarial learning, and a gating mechanism via Factorized Bilinear Pooling, achieving state-of-the-art results on CMU-MOSI, CMU-MOSEI, and UR_FUNNY with demonstrated robustness to noise and rapid emotional fluctuations. These findings suggest that jointly optimizing cross-modal invariance and temporal smoothness yields more reliable sentiment decoding in realistic, noisy settings.
Abstract
Achieving consistent sentiment representation across diverse modalities remains a key challenge in multimodal sentiment analysis. However, rapid emotional fluctuations over time often introduce instability, leading to compromised prediction performance. To address this challenge, we propose a robust sentiment representation dual enhancement strategy that simultaneously enhances the temporal and modality dimensions, guided by targeted mechanisms in both forward and backward propagation. Specifically, in the modality dimension, we introduce a modality invariant fusion mechanism that fosters stable cross-modal representations, which aim to capture the common and stable representations shared across different modalities. In the temporal dimension, we impose a specialized sequential variation regularization term that regulates the model's learning trajectory during backward propagation, which is essentially total variation regularization degenerated into one-dimensional linear differences. Extensive experiments on three standard public datasets validate the effectiveness of our proposed approach.
