Table of Contents
Fetching ...

MoLAN: A Unified Modality-Aware Noise Dynamic Editing Framework for Multimodal Sentiment Analysis

Xingle Xu, Yongkang Liu, Dexian Cai, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang

TL;DR

MoLAN addresses noise in multimodal sentiment analysis by introducing a block-based, modality-aware denoising framework that partitions each modality into sub-blocks and assigns dynamic denoising strengths. It is plug-and-play across MSA models and Multimodal Large Language Models (MLLMs), with MoLAN+ adding noise-suppressed cross-attention and denoising-driven contrastive learning to sharpen cross-modal alignment. Extensive experiments on four datasets across multiple baselines show broad improvements and state-of-the-art performance, validating the approach's robustness to modality noise. The work offers a scalable, practical path toward robust, high-quality multimodal representations for sentiment analysis in real-world noisy environments.

Abstract

Multimodal Sentiment Analysis aims to integrate information from various modalities, such as audio, visual, and text, to make complementary predictions. However, it often struggles with irrelevant or misleading visual and auditory information. Most existing approaches typically treat the entire modality information (e.g., a whole image, audio segment, or text paragraph) as an independent unit for feature enhancement or denoising. They often suppress the redundant and noise information at the risk of losing critical information. To address this challenge, we propose MoLAN, a unified ModaLity-aware noise dynAmic editiNg framework. Specifically, MoLAN performs modality-aware blocking by dividing the features of each modality into multiple blocks. Each block is then dynamically assigned a distinct denoising strength based on its noise level and semantic relevance, enabling fine-grained noise suppression while preserving essential multimodal information. Notably, MoLAN is a unified and flexible framework that can be seamlessly integrated into a wide range of multimodal models. Building upon this framework, we further introduce MoLAN+, a new multimodal sentiment analysis approach. Experiments across five models and four datasets demonstrate the broad effectiveness of the MoLAN framework. Extensive evaluations show that MoLAN+ achieves the state-of-the-art performance. The code is publicly available at https://github.com/betterfly123/MoLAN-Framework.

MoLAN: A Unified Modality-Aware Noise Dynamic Editing Framework for Multimodal Sentiment Analysis

TL;DR

MoLAN addresses noise in multimodal sentiment analysis by introducing a block-based, modality-aware denoising framework that partitions each modality into sub-blocks and assigns dynamic denoising strengths. It is plug-and-play across MSA models and Multimodal Large Language Models (MLLMs), with MoLAN+ adding noise-suppressed cross-attention and denoising-driven contrastive learning to sharpen cross-modal alignment. Extensive experiments on four datasets across multiple baselines show broad improvements and state-of-the-art performance, validating the approach's robustness to modality noise. The work offers a scalable, practical path toward robust, high-quality multimodal representations for sentiment analysis in real-world noisy environments.

Abstract

Multimodal Sentiment Analysis aims to integrate information from various modalities, such as audio, visual, and text, to make complementary predictions. However, it often struggles with irrelevant or misleading visual and auditory information. Most existing approaches typically treat the entire modality information (e.g., a whole image, audio segment, or text paragraph) as an independent unit for feature enhancement or denoising. They often suppress the redundant and noise information at the risk of losing critical information. To address this challenge, we propose MoLAN, a unified ModaLity-aware noise dynAmic editiNg framework. Specifically, MoLAN performs modality-aware blocking by dividing the features of each modality into multiple blocks. Each block is then dynamically assigned a distinct denoising strength based on its noise level and semantic relevance, enabling fine-grained noise suppression while preserving essential multimodal information. Notably, MoLAN is a unified and flexible framework that can be seamlessly integrated into a wide range of multimodal models. Building upon this framework, we further introduce MoLAN+, a new multimodal sentiment analysis approach. Experiments across five models and four datasets demonstrate the broad effectiveness of the MoLAN framework. Extensive evaluations show that MoLAN+ achieves the state-of-the-art performance. The code is publicly available at https://github.com/betterfly123/MoLAN-Framework.

Paper Structure

This paper contains 34 sections, 19 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Distribution of noise. Lighter colors in the region mean more noise and less useful information.
  • Figure 2: An illustration of MoLAN framework and MoLAN+ method. The purple box above represents the MoLAN framework, and the below represents the entire process of the MoLAN+ method. The MoLAN framework shown above provides a detailed description of the MoLAN block presented in MoLAN+ method.
  • Figure 3: Pixel-level heatmap. Color intensity indicates the magnitude of the pixel value, with brighter areas representing stronger image information.
  • Figure 4: Performance comparison under different similarity thresholds $\theta$. The four curves represent experimental results on CMU-MOSI, CMU-MOSEI, CH-SIMS, and IEMOCAP datasets, respectively. The x-axis denotes the similarity threshold $\theta$, and the y-axis indicates the model performance.
  • Figure 5: Case Study. The blue area in the visual modality represents the target person. The red, green, and blue colors in the audio modality represent the attention distribution of MMML, t-HNE, and MoLAN+ on different audio segments, respectively. The red boxes in the audio mark the noisy segments. The heatmap shows the model's attention strength in different visual regions.