Transpose Attack: Stealing Datasets with Bidirectional Training

Guy Amit; Mosh Levy; Yisroel Mirsky

Transpose Attack: Stealing Datasets with Bidirectional Training

Guy Amit, Mosh Levy, Yisroel Mirsky

TL;DR

The paper identifies a vulnerability in deep neural networks that enables simultaneous forward-task execution and backward covert memorization via a transpose attack, where weight matrices are inverted to form a backward model. It introduces a spatial-indexed memorization mechanism that allows systematic retrieval of memorized samples, demonstrating that architectures from FC to CNNs and ViTs can memorize and exfiltrate tens of thousands of samples. The authors formalize the attack, provide a training protocol with shared weights, and propose automated detection and practical countermeasures, including gradient-honeypot–style detection and code-audit recommendations. The work highlights significant privacy and IP risks in protected environments (e.g., FL, DTaaS) and charts a path for defenses and further research in covert dual-task neural systems.

Abstract

Deep neural networks are normally executed in the forward direction. However, in this work, we identify a vulnerability that enables models to be trained in both directions and on different tasks. Adversaries can exploit this capability to hide rogue models within seemingly legitimate models. In addition, in this work we show that neural networks can be taught to systematically memorize and retrieve specific samples from datasets. Together, these findings expose a novel method in which adversaries can exfiltrate datasets from protected learning environments under the guise of legitimate models. We focus on the data exfiltration attack and show that modern architectures can be used to secretly exfiltrate tens of thousands of samples with high fidelity, high enough to compromise data privacy and even train new models. Moreover, to mitigate this threat we propose a novel approach for detecting infected models.

Transpose Attack: Stealing Datasets with Bidirectional Training

TL;DR

Abstract

Paper Structure (37 sections, 8 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 37 sections, 8 equations, 11 figures, 6 tables, 1 algorithm.

Introduction
Attack Model
Transpose Attack
Background
Backward Execution
Transposing a Layer
Transposing a Model
Model Training
Data Memorization
Spatial Indexing
Memorization Training Objective
Evaluation
Experiment Setup
Image Quality (Confidentiality)
Data Reuse (IP Theft)
...and 22 more sections

Figures (11)

Figure 1: The attack model explored in this paper. An attacker trains a classification model with a hidden secondary task of memorizing a protected dataset. The model passes inspection and is exploited off site.
Figure 2: Left - An overview of the transpose models: model $f_\theta$ is trained on the overt primary objective of $f(x)=y$, where $\theta= \{\theta_0,\theta_1,...\}$. In parallel, a transpose model $f'$ is trained on the secondary covert task of $f'_{\theta^T}(e)=z$, where $\theta^T= \{\theta_{m-1}^T,\theta_{m-2}^T,...\}$. The weights $\theta$ and $\theta^T$ are shared between the models during training. Therefore, the attacker can export the seemingly benign $\theta$ and then later recreate $\theta^T$ to use $f'$. Right: an example of how a CNN (VGG-19) is transposed.
Figure 3: An example of $n$-dimensional spatial index $I$ where $n=3$. Given the $i$-th item in class $c$, the spatial index is calculated by (1) finding the $n$-ary Gray code for $i$ and then (2) adding to it the one-hot encoding for $c$ (multiplied by $n$). In this example, only $n^n=27$ items can be indexed per class.
Figure 4: Samples of images retrieved using $f_{\theta^T}$ with different models. For example, "give me the 472nd image for class car" or "give me the 15th image of identity A." Left to right: When we increase the number of images that $f_{\theta^T}$ must memorize, the quality of the retrieved images degrades.
Figure 5: The trade-off between the performance on the primary and secondary tasks as a function of the number of samples memorized by $f'$. Rows: Models grouped by their datasets. Columns: The performance on the primary and secondary tasks. Pixel accuracy is MSE and feature accuracy is the accuracy of different classifier (described in \ref{['sec:exp_setup']}) executed on the retrieved images. Each data point is the average result from five random runs.
...and 6 more figures

Transpose Attack: Stealing Datasets with Bidirectional Training

TL;DR

Abstract

Transpose Attack: Stealing Datasets with Bidirectional Training

Authors

TL;DR

Abstract

Table of Contents

Figures (11)