Table of Contents
Fetching ...

PATFinger: Prompt-Adapted Transferable Fingerprinting against Unauthorized Multimodal Dataset Usage

Wenyi Zhang, Ju Jia, Xiaojun Jia, Yihao Huang, Xinfeng Li, Cong Wu, Lina Wang

TL;DR

PATFinger tackles unauthorized multimodal dataset usage by providing a training-free fingerprinting framework that combines a Global Optimal Perturbation (GOP) with modality-aware adaptive prompts to capture cross-modal decision boundaries. The GOP, generated via a generator and discriminators, induces embedding drift across modalities, while the adaptive prompts align with GOP samples to profile image-text matching behavior on a surrogate model. Textual constraints are used to ensure interpretability and transferability of the prompts across models, enabling effective ownership verification through retrieval queries without modifying the original data. Extensive experiments across COCO, Flickr, OpenImages, and TextCaps show that PATFinger outperforms state-of-the-art baselines by substantial margins in verification capability (ΔR) and remains robust under dataset pruning and defense mechanisms, with good black-box transferability. The approach offers a practical, nonintrusive means for dataset owners to detect unauthorized usage in real-world cross-modal retrieval systems.

Abstract

The multimodal datasets can be leveraged to pre-train large-scale vision-language models by providing cross-modal semantics. Current endeavors for determining the usage of datasets mainly focus on single-modal dataset ownership verification through intrusive methods and non-intrusive techniques, while cross-modal approaches remain under-explored. Intrusive methods can adapt to multimodal datasets but degrade model accuracy, while non-intrusive methods rely on label-driven decision boundaries that fail to guarantee stable behaviors for verification. To address these issues, we propose a novel prompt-adapted transferable fingerprinting scheme from a training-free perspective, called PATFinger, which incorporates the global optimal perturbation (GOP) and the adaptive prompts to capture dataset-specific distribution characteristics. Our scheme utilizes inherent dataset attributes as fingerprints instead of compelling the model to learn triggers. The GOP is derived from the sample distribution to maximize embedding drifts between different modalities. Subsequently, our PATFinger re-aligns the adaptive prompt with GOP samples to capture the cross-modal interactions on the carefully crafted surrogate model. This allows the dataset owner to check the usage of datasets by observing specific prediction behaviors linked to the PATFinger during retrieval queries. Extensive experiments demonstrate the effectiveness of our scheme against unauthorized multimodal dataset usage on various cross-modal retrieval architectures by 30% over state-of-the-art baselines.

PATFinger: Prompt-Adapted Transferable Fingerprinting against Unauthorized Multimodal Dataset Usage

TL;DR

PATFinger tackles unauthorized multimodal dataset usage by providing a training-free fingerprinting framework that combines a Global Optimal Perturbation (GOP) with modality-aware adaptive prompts to capture cross-modal decision boundaries. The GOP, generated via a generator and discriminators, induces embedding drift across modalities, while the adaptive prompts align with GOP samples to profile image-text matching behavior on a surrogate model. Textual constraints are used to ensure interpretability and transferability of the prompts across models, enabling effective ownership verification through retrieval queries without modifying the original data. Extensive experiments across COCO, Flickr, OpenImages, and TextCaps show that PATFinger outperforms state-of-the-art baselines by substantial margins in verification capability (ΔR) and remains robust under dataset pruning and defense mechanisms, with good black-box transferability. The approach offers a practical, nonintrusive means for dataset owners to detect unauthorized usage in real-world cross-modal retrieval systems.

Abstract

The multimodal datasets can be leveraged to pre-train large-scale vision-language models by providing cross-modal semantics. Current endeavors for determining the usage of datasets mainly focus on single-modal dataset ownership verification through intrusive methods and non-intrusive techniques, while cross-modal approaches remain under-explored. Intrusive methods can adapt to multimodal datasets but degrade model accuracy, while non-intrusive methods rely on label-driven decision boundaries that fail to guarantee stable behaviors for verification. To address these issues, we propose a novel prompt-adapted transferable fingerprinting scheme from a training-free perspective, called PATFinger, which incorporates the global optimal perturbation (GOP) and the adaptive prompts to capture dataset-specific distribution characteristics. Our scheme utilizes inherent dataset attributes as fingerprints instead of compelling the model to learn triggers. The GOP is derived from the sample distribution to maximize embedding drifts between different modalities. Subsequently, our PATFinger re-aligns the adaptive prompt with GOP samples to capture the cross-modal interactions on the carefully crafted surrogate model. This allows the dataset owner to check the usage of datasets by observing specific prediction behaviors linked to the PATFinger during retrieval queries. Extensive experiments demonstrate the effectiveness of our scheme against unauthorized multimodal dataset usage on various cross-modal retrieval architectures by 30% over state-of-the-art baselines.

Paper Structure

This paper contains 21 sections, 17 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The differences between existing methods and our scheme for judging the usage of multimodal datasets.
  • Figure 2: The pipeline of PATFinger consists of three stages. (a) The GOP generation aims to reveal the global dataset distribution from intra-modal and inter-modal relationships. (b) The learnable prompt will align with the GOP samples, where the token network represents the textual constraint. (c) Evaluate whether suspicious models have been trained on the owner dataset.
  • Figure 3: The example of protected samples generated by different methods based on COCO. The red area shows the modified part.
  • Figure 4: Robustness of PATFinger and baseline methods against dataset pruning for IT retrieval task.
  • Figure 5: The performance of PATFinger under different settings for IT retrieval task. (a) and (b) investigate the effect of perturbation budget. (c) and (d) evaluate the impact of prompt setting.