DCER: Dual-Stage Compression and Energy-Based Reconstruction
Yiwen Wang, Jiahao Qin
TL;DR
DCER tackles robustness in multimodal sentiment analysis under noisy and missing inputs by introducing dual-stage compression and energy-based reconstruction. The first stage applies within-modality frequency-domain compression (audio with wavelets at $L=3$, video with four-band DCT) and incorporates a cross-modality bottleneck to enforce genuine integration, while the second stage reconstructs missing modalities through an energy-based inference process that provides calibrated uncertainty. The approach yields state-of-the-art results on CMU-MOSI, CMU-MOSEI, and CH-SIMS and exhibits a distinctive U-shaped robustness pattern that favors fusion at both complete and high-missing data, with energy-based uncertainty enabling selective rejection of uncertain predictions. The work highlights the value of leveraging known signal structure before fusion and suggests broader applicability to other multimodal tasks facing missing data and noise. Overall, DCER advances robust multimodal fusion by combining principled compression with energy-based reconstruction for uncertainty-aware predictions.
Abstract
Multimodal fusion faces two robustness challenges: noisy inputs degrade representation quality, and missing modalities cause prediction failures. We propose DCER, a unified framework addressing both challenges through dual-stage compression and energy-based reconstruction. The compression stage operates at two levels: within-modality frequency transforms (wavelet for audio, DCT for video) remove noise while preserving task-relevant patterns, and cross-modality bottleneck tokens force genuine integration rather than modality-specific shortcuts. For missing modalities, energy-based reconstruction recovers representations via gradient descent on a learned energy function, with the final energy providing intrinsic uncertainty quantification (\r{ho} > 0.72 correlation with prediction error). Experiments on CMU-MOSI, CMU-MOSEI, and CH-SIMS demonstrate state-of-the-art performance across all benchmarks, with a U-shaped robustness pattern favoring multimodal fusion at both complete and high-missing conditions. The code will be available on Github.
