Table of Contents
Fetching ...

Gibberish is All You Need for Membership Inference Detection in Contrastive Language-Audio Pretraining

Ruoxi Cheng, Yizhong Ding, Shuirong Cao, Shitong Shao, Zhiqiang Wang

TL;DR

This work proposes USMID, a textual unimodal speaker-level membership inference detector, querying the target model using only text data, and demonstrates that USMID outperforms baseline methods using only text data.

Abstract

Audio can disclose PII, particularly when combined with related text data. Therefore, it is essential to develop tools to detect privacy leakage in Contrastive Language-Audio Pretraining(CLAP). Existing MIAs need audio as input, risking exposure of voiceprint and requiring costly shadow models. We first propose PRMID, a membership inference detector based probability ranking given by CLAP, which does not require training shadow models but still requires both audio and text of the individual as input. To address these limitations, we then propose USMID, a textual unimodal speaker-level membership inference detector, querying the target model using only text data. We randomly generate textual gibberish that are clearly not in training dataset. Then we extract feature vectors from these texts using the CLAP model and train a set of anomaly detectors on them. During inference, the feature vector of each test text is input into the anomaly detector to determine if the speaker is in the training set (anomalous) or not (normal). If available, USMID can further enhance detection by integrating real audio of the tested speaker. Extensive experiments on various CLAP model architectures and datasets demonstrate that USMID outperforms baseline methods using only text data.

Gibberish is All You Need for Membership Inference Detection in Contrastive Language-Audio Pretraining

TL;DR

This work proposes USMID, a textual unimodal speaker-level membership inference detector, querying the target model using only text data, and demonstrates that USMID outperforms baseline methods using only text data.

Abstract

Audio can disclose PII, particularly when combined with related text data. Therefore, it is essential to develop tools to detect privacy leakage in Contrastive Language-Audio Pretraining(CLAP). Existing MIAs need audio as input, risking exposure of voiceprint and requiring costly shadow models. We first propose PRMID, a membership inference detector based probability ranking given by CLAP, which does not require training shadow models but still requires both audio and text of the individual as input. To address these limitations, we then propose USMID, a textual unimodal speaker-level membership inference detector, querying the target model using only text data. We randomly generate textual gibberish that are clearly not in training dataset. Then we extract feature vectors from these texts using the CLAP model and train a set of anomaly detectors on them. During inference, the feature vector of each test text is input into the anomaly detector to determine if the speaker is in the training set (anomalous) or not (normal). If available, USMID can further enhance detection by integrating real audio of the tested speaker. Extensive experiments on various CLAP model architectures and datasets demonstrate that USMID outperforms baseline methods using only text data.

Paper Structure

This paper contains 13 sections, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: Current MIAs on MCL always query with dual-modal data of the tested individual for inference, while our goal is to avoid this.
  • Figure 2: Optimization of audio is guided by a CLAP model trained on LibriSpeech dataset where each person has 50 audios. Distance between the embeddings of optimized audio and tested text, and probability score of the tested text among gibberish, can clearly distinguish between samples within and outside the training set of target CLAP model.
  • Figure 3: To determine whether a person's text is in the training set, we input his audio alongside a collection of other individuals' audios into the CLAP model. The model then generates a probability distribution based on the matching scores, which we use to conduct inference.
  • Figure 4: To determine whether a person's audio is in the training set, we input his text alongside a collection of texts from other individuals.
  • Figure 5: Overview of USMID.
  • ...and 6 more figures