Table of Contents
Fetching ...

Towards Traitor Tracing in Black-and-White-Box DNN Watermarking with Tardos-based Codes

Elena Rodriguez-Lois, Fernando Perez-Gonzalez

TL;DR

This work addresses traitor tracing for DNN watermarking when only black-box access may be available, proposing a unified black-and-white-box framework that combines $q$-ary Tardos codes for black-box fingerprinting with orthogonal codes for white-box fingerprinting. The black-box component uses a secret Dirichlet bias $F(oldsymbol{p})$ and a SPRT-based accusation mechanism with scores derived from $U_1(p)$ and $U_0(p)$, while the white-box component embeds orthogonal fingerprints via a regularized loss that leverages a projection $ extbf{D}$ and basis $ extbf{S}$ to produce identifiable projections $r_j$. Empirical validation on MNIST demonstrates that traitor tracing can identify colluders with substantially fewer queries when trigger sets are shared, that higher $oldsymbol{ ext{kappa}}$ can improve performance under Marking Assumption violations, and that the main task accuracy remains largely unaffected. The results reveal practical potential for catch-one traitor tracing before granting full model access, but also highlight limitations related to MA violations and the need for broader evaluation across architectures and attack types.

Abstract

The growing popularity of Deep Neural Networks, which often require computationally expensive training and access to a vast amount of data, calls for accurate authorship verification methods to deter unlawful dissemination of the models and identify the source of the leak. In DNN watermarking the owner may have access to the full network (white-box) or only be able to extract information from its output to queries (black-box), but a watermarked model may include both approaches in order to gather sufficient evidence to then gain access to the network. Although there has been limited research in white-box watermarking that considers traitor tracing, this problem is yet to be explored in the black-box scenario. In this paper, we propose a black-and-white-box watermarking method for DNN classifiers that opens the door to collusion-resistant traitor tracing in black-box, exploiting the properties of Tardos codes, and making it possible to identify the source of the leak before access to the model is granted. While experimental results show that the method can successfully identify traitors, even when further attacks have been performed, we also discuss its limitations and open problems for traitor tracing in black-box.

Towards Traitor Tracing in Black-and-White-Box DNN Watermarking with Tardos-based Codes

TL;DR

This work addresses traitor tracing for DNN watermarking when only black-box access may be available, proposing a unified black-and-white-box framework that combines -ary Tardos codes for black-box fingerprinting with orthogonal codes for white-box fingerprinting. The black-box component uses a secret Dirichlet bias and a SPRT-based accusation mechanism with scores derived from and , while the white-box component embeds orthogonal fingerprints via a regularized loss that leverages a projection and basis to produce identifiable projections . Empirical validation on MNIST demonstrates that traitor tracing can identify colluders with substantially fewer queries when trigger sets are shared, that higher can improve performance under Marking Assumption violations, and that the main task accuracy remains largely unaffected. The results reveal practical potential for catch-one traitor tracing before granting full model access, but also highlight limitations related to MA violations and the need for broader evaluation across architectures and attack types.

Abstract

The growing popularity of Deep Neural Networks, which often require computationally expensive training and access to a vast amount of data, calls for accurate authorship verification methods to deter unlawful dissemination of the models and identify the source of the leak. In DNN watermarking the owner may have access to the full network (white-box) or only be able to extract information from its output to queries (black-box), but a watermarked model may include both approaches in order to gather sufficient evidence to then gain access to the network. Although there has been limited research in white-box watermarking that considers traitor tracing, this problem is yet to be explored in the black-box scenario. In this paper, we propose a black-and-white-box watermarking method for DNN classifiers that opens the door to collusion-resistant traitor tracing in black-box, exploiting the properties of Tardos codes, and making it possible to identify the source of the leak before access to the model is granted. While experimental results show that the method can successfully identify traitors, even when further attacks have been performed, we also discuss its limitations and open problems for traitor tracing in black-box.
Paper Structure (17 sections, 9 equations, 5 figures, 2 tables)

This paper contains 17 sections, 9 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Impact of using Tardos codes on the number of queries needed for a single user before an accusation.
  • Figure 2: Experimental distribution of $t^*$ according to $\kappa$.
  • Figure 3: Evolution of the main task accuracy on $\mathcal{T}$.
  • Figure 4: Experimental distribution of $t^*$ according to $\mathcal{T}$.
  • Figure 5: Experimental distribution of $r_j$.