Table of Contents
Fetching ...

DCER: Dual-Stage Compression and Energy-Based Reconstruction

Yiwen Wang, Jiahao Qin

TL;DR

DCER tackles robustness in multimodal sentiment analysis under noisy and missing inputs by introducing dual-stage compression and energy-based reconstruction. The first stage applies within-modality frequency-domain compression (audio with wavelets at $L=3$, video with four-band DCT) and incorporates a cross-modality bottleneck to enforce genuine integration, while the second stage reconstructs missing modalities through an energy-based inference process that provides calibrated uncertainty. The approach yields state-of-the-art results on CMU-MOSI, CMU-MOSEI, and CH-SIMS and exhibits a distinctive U-shaped robustness pattern that favors fusion at both complete and high-missing data, with energy-based uncertainty enabling selective rejection of uncertain predictions. The work highlights the value of leveraging known signal structure before fusion and suggests broader applicability to other multimodal tasks facing missing data and noise. Overall, DCER advances robust multimodal fusion by combining principled compression with energy-based reconstruction for uncertainty-aware predictions.

Abstract

Multimodal fusion faces two robustness challenges: noisy inputs degrade representation quality, and missing modalities cause prediction failures. We propose DCER, a unified framework addressing both challenges through dual-stage compression and energy-based reconstruction. The compression stage operates at two levels: within-modality frequency transforms (wavelet for audio, DCT for video) remove noise while preserving task-relevant patterns, and cross-modality bottleneck tokens force genuine integration rather than modality-specific shortcuts. For missing modalities, energy-based reconstruction recovers representations via gradient descent on a learned energy function, with the final energy providing intrinsic uncertainty quantification (\r{ho} > 0.72 correlation with prediction error). Experiments on CMU-MOSI, CMU-MOSEI, and CH-SIMS demonstrate state-of-the-art performance across all benchmarks, with a U-shaped robustness pattern favoring multimodal fusion at both complete and high-missing conditions. The code will be available on Github.

DCER: Dual-Stage Compression and Energy-Based Reconstruction

TL;DR

DCER tackles robustness in multimodal sentiment analysis under noisy and missing inputs by introducing dual-stage compression and energy-based reconstruction. The first stage applies within-modality frequency-domain compression (audio with wavelets at , video with four-band DCT) and incorporates a cross-modality bottleneck to enforce genuine integration, while the second stage reconstructs missing modalities through an energy-based inference process that provides calibrated uncertainty. The approach yields state-of-the-art results on CMU-MOSI, CMU-MOSEI, and CH-SIMS and exhibits a distinctive U-shaped robustness pattern that favors fusion at both complete and high-missing data, with energy-based uncertainty enabling selective rejection of uncertain predictions. The work highlights the value of leveraging known signal structure before fusion and suggests broader applicability to other multimodal tasks facing missing data and noise. Overall, DCER advances robust multimodal fusion by combining principled compression with energy-based reconstruction for uncertainty-aware predictions.

Abstract

Multimodal fusion faces two robustness challenges: noisy inputs degrade representation quality, and missing modalities cause prediction failures. We propose DCER, a unified framework addressing both challenges through dual-stage compression and energy-based reconstruction. The compression stage operates at two levels: within-modality frequency transforms (wavelet for audio, DCT for video) remove noise while preserving task-relevant patterns, and cross-modality bottleneck tokens force genuine integration rather than modality-specific shortcuts. For missing modalities, energy-based reconstruction recovers representations via gradient descent on a learned energy function, with the final energy providing intrinsic uncertainty quantification (\r{ho} > 0.72 correlation with prediction error). Experiments on CMU-MOSI, CMU-MOSEI, and CH-SIMS demonstrate state-of-the-art performance across all benchmarks, with a U-shaped robustness pattern favoring multimodal fusion at both complete and high-missing conditions. The code will be available on Github.
Paper Structure (17 sections, 6 equations, 2 figures, 8 tables, 1 algorithm)

This paper contains 17 sections, 6 equations, 2 figures, 8 tables, 1 algorithm.

Figures (2)

  • Figure 1: Motivation and overview.(a) Standard multimodal fusion lacks noise filtering and learns modality-specific shortcuts, causing prediction failures when modalities are missing. (b) DCER addresses these issues through three stages: within-modality frequency compression removes noise, cross-modal bottleneck tokens force genuine integration, and energy-based reconstruction recovers missing modalities with calibrated uncertainty ($\rho > 0.72$).
  • Figure 2: DCER Architecture.Left: Modality-specific encoders apply frequency transforms (wavelet for audio, DCT for video) for within-modality compression. Center: Learnable bottleneck tokens attend to all modalities via cross-attention, implementing cross-modality bottleneck. Right: Energy-based reconstruction enables missing modality handling with uncertainty quantification.