Table of Contents
Fetching ...

Why Neural Structural Obfuscation Can't Kill White-Box Watermarks for Good!

Yanna Jiang, Guangsheng Yu, Qingyuan Yu, Yi Chen, Qin Wang

Abstract

Neural Structural Obfuscation (NSO) (USENIX Security'23) is a family of ``zero cost'' structure-editing transforms (\texttt{nso\_zero}, \texttt{nso\_clique}, \texttt{nso\_split}) that inject dummy neurons. By combining neuron permutation and parameter scaling, NSO makes a radical modification to the network structure and parameters while strictly preserving functional equivalence, thereby disrupting white-box watermark verification. This capability has been a fundamental challenge to the reliability of existing white-box watermarking schemes. We rethink NSO and, for the first time, fully recover from the damage it has caused. We redefine NSO as a graph-consistent threat model within a \textit{producer--consumer} paradigm. This formulation posits that any obfuscation of a producer node necessitates a compatible layout update in all downstream consumers to maintain structural integrity. Building on these consistency constraints on signal propagation, we present \textsc{Canon}, a recovery framework that probes the attacked model to identify redundancy/dummy channels and then \textit{globally} canonicalizes the network by rewriting \textit{all} downstream consumers by construction, synchronizing layouts across \texttt{fan-out}, \texttt{add}, and \texttt{cat}. Extensive experiments demonstrate that, even under strong composed and extended NSO attacks, \textsc{Canon} achieves \textbf{100\%} recovery success, restoring watermark verifiability while preserving task utility. Our code is available at https://anonymous.4open.science/r/anti-NSO-9874.

Why Neural Structural Obfuscation Can't Kill White-Box Watermarks for Good!

Abstract

Neural Structural Obfuscation (NSO) (USENIX Security'23) is a family of ``zero cost'' structure-editing transforms (\texttt{nso\_zero}, \texttt{nso\_clique}, \texttt{nso\_split}) that inject dummy neurons. By combining neuron permutation and parameter scaling, NSO makes a radical modification to the network structure and parameters while strictly preserving functional equivalence, thereby disrupting white-box watermark verification. This capability has been a fundamental challenge to the reliability of existing white-box watermarking schemes. We rethink NSO and, for the first time, fully recover from the damage it has caused. We redefine NSO as a graph-consistent threat model within a \textit{producer--consumer} paradigm. This formulation posits that any obfuscation of a producer node necessitates a compatible layout update in all downstream consumers to maintain structural integrity. Building on these consistency constraints on signal propagation, we present \textsc{Canon}, a recovery framework that probes the attacked model to identify redundancy/dummy channels and then \textit{globally} canonicalizes the network by rewriting \textit{all} downstream consumers by construction, synchronizing layouts across \texttt{fan-out}, \texttt{add}, and \texttt{cat}. Extensive experiments demonstrate that, even under strong composed and extended NSO attacks, \textsc{Canon} achieves \textbf{100\%} recovery success, restoring watermark verifiability while preserving task utility. Our code is available at https://anonymous.4open.science/r/anti-NSO-9874.
Paper Structure (44 sections, 23 equations, 6 figures, 13 tables, 2 algorithms)

This paper contains 44 sections, 23 equations, 6 figures, 13 tables, 2 algorithms.

Figures (6)

  • Figure 1: Canon can recover a strengthened NSO variant that generalizes the original layer-local edits into graph-aware structural transformations that remain consistent across branches and merges, enabling attacks to be injected globally rather than only in isolated layers. The three NSO attacks correspond to different constrained forms of $M$ (zero injection, clique cancellation $\sum w=0$, and split preservation $\sum \alpha=1$).
  • Figure 2: Neuron-view global synchronization. An adversary performs channel layout obfuscation by introducing a linear map on activations, writing $y_{\text{before}} = M_{\text{inj}}, y_{\text{after}}$ (with $M_{\text{inj}}$ typically sparse/structured and possibly non-square), and injecting redundant dummy channels (gray) via primitives such as nso_split and nso_clique while preserving end-to-end input–output behavior. (i) For residual add (ResNet), From probe activations, we infer a channel transform $M_A$ on one branch. Since element-wise add requires layout compatibility, we enforce layout synchronization by applying a compatible rewrite on the sibling edge, even if it is attack-free. The unified add-output layout is then propagated by rewriting all downstream linear consumers (fan-out) via $W \leftarrow W\, M_A$, without requiring local rediscovery of redundancy. (ii) For channel concatenation (Inception), each branch is compacted independently with transforms $(M_X, M_Y)$. At cat, these compose into a block-diagonal transform $\mathrm{blkdiag}(M_X, M_Y)$, which is propagated to all downstream consumers via $W \leftarrow W\times\mathrm{blkdiag}(M_X, M_Y)$.
  • Figure 3: Recovery. By clustering activation signatures from probe inputs, we infer a compacting transform $M_e$. Rather than local pruning, our graph-consistent policy propagates $M_e$ to all downstream linear consumers via $W_{\text{new}} = W_{\text{old}}\, M_e$, enforcing consistent channel layouts across residual add and enabling block-diagonal handling of cat. This restores watermark verification on the recovered parameters $\hat{\theta}$.
  • Figure 4: Canon recovery time on ResNet-18 versus the number of probe inputs under different probe batch sizes and attack ratios. Each plot includes the clean baseline and the attacked cases (nso_zero, nso_split, nso_clique, mix-opseq, mix-opseq (per-merge-group)). Increasing the probe number generally increases runtime due to additional activation collection and clustering, while higher injection ratios amplify the cost of redundancy detection and graph-consistent consumer rewrites. The legend is shown only in the first subgraph, as it applies to all subgraphs.
  • Figure 5: Canon recovery time under five NSO attack variants for ResNet-18 and DenseNet across multiple white-box watermarking schemes. Overall, recovery cost is attack-dependent and consistently highest for the composed mix-opseq setting, with a pronounced increase at $\rho=0.5$, while simpler attacks (nso_zero / nso_split / nso_clique) remain comparatively inexpensive and stable across watermark methods.
  • ...and 1 more figures