Table of Contents
Fetching ...

TUNI: A Textual Unimodal Detector for Identity Inference in CLIP Models

Songze Li, Ruoxi Cheng, Xiaojun Jia

TL;DR

This work tackles privacy leakage in CLIP by addressing identity inference with only textual data. It introduces TUNI, a textual unimodal detector that reframes ID inference as anomaly detection, using CLIP-guided image optimization to extract text features, and training multiple anomaly detectors on randomly generated gibberish text. Across six CLIP architectures and multiple large-scale datasets, TUNI consistently outperforms baselines that rely on image queries or shadow models, with further gains when real images are available. The approach minimizes exposure risks and offers a practical privacy-auditing tool for multimodal models, while also discussing defenses, limitations, and ethical considerations.

Abstract

The widespread usage of large-scale multimodal models like CLIP has heightened concerns about the leakage of PII. Existing methods for identity inference in CLIP models require querying the model with full PII, including textual descriptions of the person and corresponding images (e.g., the name and the face photo of the person). However, applying images may risk exposing personal information to target models, as the image might not have been previously encountered by the target model. Additionally, previous MIAs train shadow models to mimic the behaviors of the target model, which incurs high computational costs, especially for large CLIP models. To address these challenges, we propose a textual unimodal detector (TUNI) in CLIP models, a novel technique for identity inference that: 1) only utilizes text data to query the target model; and 2) eliminates the need for training shadow models. Extensive experiments of TUNI across various CLIP model architectures and datasets demonstrate its superior performance over baselines, albeit with only text data.

TUNI: A Textual Unimodal Detector for Identity Inference in CLIP Models

TL;DR

This work tackles privacy leakage in CLIP by addressing identity inference with only textual data. It introduces TUNI, a textual unimodal detector that reframes ID inference as anomaly detection, using CLIP-guided image optimization to extract text features, and training multiple anomaly detectors on randomly generated gibberish text. Across six CLIP architectures and multiple large-scale datasets, TUNI consistently outperforms baselines that rely on image queries or shadow models, with further gains when real images are available. The approach minimizes exposure risks and offers a practical privacy-auditing tool for multimodal models, while also discussing defenses, limitations, and ethical considerations.

Abstract

The widespread usage of large-scale multimodal models like CLIP has heightened concerns about the leakage of PII. Existing methods for identity inference in CLIP models require querying the model with full PII, including textual descriptions of the person and corresponding images (e.g., the name and the face photo of the person). However, applying images may risk exposing personal information to target models, as the image might not have been previously encountered by the target model. Additionally, previous MIAs train shadow models to mimic the behaviors of the target model, which incurs high computational costs, especially for large CLIP models. To address these challenges, we propose a textual unimodal detector (TUNI) in CLIP models, a novel technique for identity inference that: 1) only utilizes text data to query the target model; and 2) eliminates the need for training shadow models. Extensive experiments of TUNI across various CLIP model architectures and datasets demonstrate its superior performance over baselines, albeit with only text data.
Paper Structure (18 sections, 10 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Current methods query LLMs with both text and image, while our goal is to conduct identity inference with only textual data.
  • Figure 2: Features of textual descriptions extracted from the optimized images guided by a CLIP model with ResNet50x4 architecture, trained on a dataset where each person has 75 images. The cosine similarity between the embeddings of optimized image and the tested text, and the distance between the embeddings of the optimized images, can clearly distinguish between the samples within and outside the training dataset of the target CLIP model.
  • Figure 3: Overview of TUNI.
  • Figure 4: Samples from the dataset for training CLIP models.
  • Figure 5: Detection accuracy for different numbers of optimization iterations per epoch.
  • ...and 5 more figures