Table of Contents
Fetching ...

HyDiscGAN: A Hybrid Distributed cGAN for Audio-Visual Privacy Preservation in Multimodal Sentiment Analysis

Zhuojia Wu, Qi Zhang, Duoqian Miao, Kun Yi, Wei Fan, Liang Hu

TL;DR

This work addresses privacy risks in multimodal sentiment analysis by separating privacy requirements across modalities and introducing HyDiscGAN, a hybrid distributed cross-modality cGAN. The method pretrains a cross-modality generator on the server to produce fake audio and visual features conditioned on shareable text, then trains the MSA component with discriminators frozen to preserve privacy during inference. Two-stage training—modal alignment followed by MSA optimization—yields strong sentiment performance while reducing client-side computational burden and enabling privacy-preserving testing. Empirical results on MOSI and MOSEI show HyDiscGAN is competitive with state-of-the-art models, with notable advantages in privacy-preserving distributed settings and clear gains from the customized contrastive losses and feature-generation strategy. Overall, the approach offers a scalable path to secure, efficient, and effective multimodal sentiment analysis in real-world distributed environments.

Abstract

Multimodal Sentiment Analysis (MSA) aims to identify speakers' sentiment tendencies in multimodal video content, raising serious concerns about privacy risks associated with multimodal data, such as voiceprints and facial images. Recent distributed collaborative learning has been verified as an effective paradigm for privacy preservation in multimodal tasks. However, they often overlook the privacy distinctions among different modalities, struggling to strike a balance between performance and privacy preservation. Consequently, it poses an intriguing question of maximizing multimodal utilization to improve performance while simultaneously protecting necessary modalities. This paper forms the first attempt at modality-specified (i.e., audio and visual) privacy preservation in MSA tasks. We propose a novel Hybrid Distributed cross-modality cGAN framework (HyDiscGAN), which learns multimodality alignment to generate fake audio and visual features conditioned on shareable de-identified textual data. The objective is to leverage the fake features to approximate real audio and visual content to guarantee privacy preservation while effectively enhancing performance. Extensive experiments show that compared with the state-of-the-art MSA model, HyDiscGAN can achieve superior or competitive performance while preserving privacy.

HyDiscGAN: A Hybrid Distributed cGAN for Audio-Visual Privacy Preservation in Multimodal Sentiment Analysis

TL;DR

This work addresses privacy risks in multimodal sentiment analysis by separating privacy requirements across modalities and introducing HyDiscGAN, a hybrid distributed cross-modality cGAN. The method pretrains a cross-modality generator on the server to produce fake audio and visual features conditioned on shareable text, then trains the MSA component with discriminators frozen to preserve privacy during inference. Two-stage training—modal alignment followed by MSA optimization—yields strong sentiment performance while reducing client-side computational burden and enabling privacy-preserving testing. Empirical results on MOSI and MOSEI show HyDiscGAN is competitive with state-of-the-art models, with notable advantages in privacy-preserving distributed settings and clear gains from the customized contrastive losses and feature-generation strategy. Overall, the approach offers a scalable path to secure, efficient, and effective multimodal sentiment analysis in real-world distributed environments.

Abstract

Multimodal Sentiment Analysis (MSA) aims to identify speakers' sentiment tendencies in multimodal video content, raising serious concerns about privacy risks associated with multimodal data, such as voiceprints and facial images. Recent distributed collaborative learning has been verified as an effective paradigm for privacy preservation in multimodal tasks. However, they often overlook the privacy distinctions among different modalities, struggling to strike a balance between performance and privacy preservation. Consequently, it poses an intriguing question of maximizing multimodal utilization to improve performance while simultaneously protecting necessary modalities. This paper forms the first attempt at modality-specified (i.e., audio and visual) privacy preservation in MSA tasks. We propose a novel Hybrid Distributed cross-modality cGAN framework (HyDiscGAN), which learns multimodality alignment to generate fake audio and visual features conditioned on shareable de-identified textual data. The objective is to leverage the fake features to approximate real audio and visual content to guarantee privacy preservation while effectively enhancing performance. Extensive experiments show that compared with the state-of-the-art MSA model, HyDiscGAN can achieve superior or competitive performance while preserving privacy.
Paper Structure (41 sections, 10 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 41 sections, 10 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of privacy preservation and MSA frameworks.
  • Figure 2: The overall process of HyDiscGAN generates "sufficiently realistic" fake features for private modalities (audio and visual) through hybrid DCL between the server and clients and subsequently fuses them with the real features of the shareable modality (text) for MSA tasks.
  • Figure 3: Convergence visualization of training the cross-modality cGAN in HyDiscGAN on MOSI and MOSEI datasets, respectively.
  • Figure 4: Visualization of the gated attention weights in the Fusion Module for visual-audio features on the test set of MOSI. Brighter regions imply higher unimodal information flow through the gates.
  • Figure 5: t-SNE Visualization of private modality real/fake $\texttt{<CLS>}$ tag features (i.e. $x^{*}_{\texttt{<CLS>}}$ and $z^{*}_{\texttt{<CLS>}}$) for all test samples on the MOSI dataset, with samples from different clients labeled in different colors. “Epoch=0, 30, and 100” represent the fake features at different epochs during the training of Cross-Modality cGAN.
  • ...and 2 more figures