Towards Traitor Tracing in Black-and-White-Box DNN Watermarking with Tardos-based Codes

Elena Rodriguez-Lois; Fernando Perez-Gonzalez

Towards Traitor Tracing in Black-and-White-Box DNN Watermarking with Tardos-based Codes

Elena Rodriguez-Lois, Fernando Perez-Gonzalez

TL;DR

This work addresses traitor tracing for DNN watermarking when only black-box access may be available, proposing a unified black-and-white-box framework that combines $q$-ary Tardos codes for black-box fingerprinting with orthogonal codes for white-box fingerprinting. The black-box component uses a secret Dirichlet bias $F(oldsymbol{p})$ and a SPRT-based accusation mechanism with scores derived from $U_1(p)$ and $U_0(p)$, while the white-box component embeds orthogonal fingerprints via a regularized loss that leverages a projection $ extbf{D}$ and basis $ extbf{S}$ to produce identifiable projections $r_j$. Empirical validation on MNIST demonstrates that traitor tracing can identify colluders with substantially fewer queries when trigger sets are shared, that higher $oldsymbol{ ext{kappa}}$ can improve performance under Marking Assumption violations, and that the main task accuracy remains largely unaffected. The results reveal practical potential for catch-one traitor tracing before granting full model access, but also highlight limitations related to MA violations and the need for broader evaluation across architectures and attack types.

Abstract

The growing popularity of Deep Neural Networks, which often require computationally expensive training and access to a vast amount of data, calls for accurate authorship verification methods to deter unlawful dissemination of the models and identify the source of the leak. In DNN watermarking the owner may have access to the full network (white-box) or only be able to extract information from its output to queries (black-box), but a watermarked model may include both approaches in order to gather sufficient evidence to then gain access to the network. Although there has been limited research in white-box watermarking that considers traitor tracing, this problem is yet to be explored in the black-box scenario. In this paper, we propose a black-and-white-box watermarking method for DNN classifiers that opens the door to collusion-resistant traitor tracing in black-box, exploiting the properties of Tardos codes, and making it possible to identify the source of the leak before access to the model is granted. While experimental results show that the method can successfully identify traitors, even when further attacks have been performed, we also discuss its limitations and open problems for traitor tracing in black-box.

Towards Traitor Tracing in Black-and-White-Box DNN Watermarking with Tardos-based Codes

TL;DR

This work addresses traitor tracing for DNN watermarking when only black-box access may be available, proposing a unified black-and-white-box framework that combines

-ary Tardos codes for black-box fingerprinting with orthogonal codes for white-box fingerprinting. The black-box component uses a secret Dirichlet bias

and a SPRT-based accusation mechanism with scores derived from

and

, while the white-box component embeds orthogonal fingerprints via a regularized loss that leverages a projection

and basis

to produce identifiable projections

. Empirical validation on MNIST demonstrates that traitor tracing can identify colluders with substantially fewer queries when trigger sets are shared, that higher

can improve performance under Marking Assumption violations, and that the main task accuracy remains largely unaffected. The results reveal practical potential for catch-one traitor tracing before granting full model access, but also highlight limitations related to MA violations and the need for broader evaluation across architectures and attack types.

Abstract

Paper Structure (17 sections, 9 equations, 5 figures, 2 tables)

This paper contains 17 sections, 9 equations, 5 figures, 2 tables.

Introduction and Previous Works
Proposed Method
Black-Box Fingerprinting with $q$-ary Tardos Codes
White-Box Fingerprinting with Orthogonal Codes
Implementation and Experimental Results
DNN Architecture and Main Task
Choice of trigger set
Watermarking Parameters and Training Process
User Attacks on the Individual Watermarks
Experimental Results
Impact of Tardos Codes
Influence of $\kappa$
Influence of $\mathcal{T}$
Evaluation of the False Negative Rate
Simultaneous White-Box Fingerprinting
...and 2 more sections

Figures (5)

Figure 1: Impact of using Tardos codes on the number of queries needed for a single user before an accusation.
Figure 2: Experimental distribution of $t^*$ according to $\kappa$.
Figure 3: Evolution of the main task accuracy on $\mathcal{T}$.
Figure 4: Experimental distribution of $t^*$ according to $\mathcal{T}$.
Figure 5: Experimental distribution of $r_j$.

Towards Traitor Tracing in Black-and-White-Box DNN Watermarking with Tardos-based Codes

TL;DR

Abstract

Towards Traitor Tracing in Black-and-White-Box DNN Watermarking with Tardos-based Codes

Authors

TL;DR

Abstract

Table of Contents

Figures (5)