HyDiscGAN: A Hybrid Distributed cGAN for Audio-Visual Privacy Preservation in Multimodal Sentiment Analysis
Zhuojia Wu, Qi Zhang, Duoqian Miao, Kun Yi, Wei Fan, Liang Hu
TL;DR
This work addresses privacy risks in multimodal sentiment analysis by separating privacy requirements across modalities and introducing HyDiscGAN, a hybrid distributed cross-modality cGAN. The method pretrains a cross-modality generator on the server to produce fake audio and visual features conditioned on shareable text, then trains the MSA component with discriminators frozen to preserve privacy during inference. Two-stage training—modal alignment followed by MSA optimization—yields strong sentiment performance while reducing client-side computational burden and enabling privacy-preserving testing. Empirical results on MOSI and MOSEI show HyDiscGAN is competitive with state-of-the-art models, with notable advantages in privacy-preserving distributed settings and clear gains from the customized contrastive losses and feature-generation strategy. Overall, the approach offers a scalable path to secure, efficient, and effective multimodal sentiment analysis in real-world distributed environments.
Abstract
Multimodal Sentiment Analysis (MSA) aims to identify speakers' sentiment tendencies in multimodal video content, raising serious concerns about privacy risks associated with multimodal data, such as voiceprints and facial images. Recent distributed collaborative learning has been verified as an effective paradigm for privacy preservation in multimodal tasks. However, they often overlook the privacy distinctions among different modalities, struggling to strike a balance between performance and privacy preservation. Consequently, it poses an intriguing question of maximizing multimodal utilization to improve performance while simultaneously protecting necessary modalities. This paper forms the first attempt at modality-specified (i.e., audio and visual) privacy preservation in MSA tasks. We propose a novel Hybrid Distributed cross-modality cGAN framework (HyDiscGAN), which learns multimodality alignment to generate fake audio and visual features conditioned on shareable de-identified textual data. The objective is to leverage the fake features to approximate real audio and visual content to guarantee privacy preservation while effectively enhancing performance. Extensive experiments show that compared with the state-of-the-art MSA model, HyDiscGAN can achieve superior or competitive performance while preserving privacy.
