Table of Contents
Fetching ...

Towards White Box Deep Learning

Maciej Satkiewicz

TL;DR

The paper tackles the vulnerability and interpretability gap of deep neural networks by introducing semantic features—locality-sensitive invariants that regularize representations. It proposes a lightweight, four-layer PoC white-box network built from semantic features, including a Two Step Layer, a Convolutional Semantic Layer, an Affine Layer, and a Logical Layer, enabling human-aligned, interpretable features and robust behavior. On MNIST 3 vs 5, the model achieves ~92% adversarial accuracy under AutoAttack without adversarial training, ~98% adversarial precision at 80% recall, and ~99.5% clean accuracy, with ablations showing the critical contributions of initial layers and targeted augmentation. The work suggests that locality-engineered semantic features can improve robustness and interpretability, offering a blueprint for extending semantic features to broader domains and modalities while outlining limitations and directions for future research.

Abstract

Deep neural networks learn fragile "shortcut" features, rendering them difficult to interpret (black box) and vulnerable to adversarial attacks. This paper proposes semantic features as a general architectural solution to this problem. The main idea is to make features locality-sensitive in the adequate semantic topology of the domain, thus introducing a strong regularization. The proof of concept network is lightweight, inherently interpretable and achieves almost human-level adversarial test metrics - with no adversarial training! These results and the general nature of the approach warrant further research on semantic features. The code is available at https://github.com/314-Foundation/white-box-nn

Towards White Box Deep Learning

TL;DR

The paper tackles the vulnerability and interpretability gap of deep neural networks by introducing semantic features—locality-sensitive invariants that regularize representations. It proposes a lightweight, four-layer PoC white-box network built from semantic features, including a Two Step Layer, a Convolutional Semantic Layer, an Affine Layer, and a Logical Layer, enabling human-aligned, interpretable features and robust behavior. On MNIST 3 vs 5, the model achieves ~92% adversarial accuracy under AutoAttack without adversarial training, ~98% adversarial precision at 80% recall, and ~99.5% clean accuracy, with ablations showing the critical contributions of initial layers and targeted augmentation. The work suggests that locality-engineered semantic features can improve robustness and interpretability, offering a blueprint for extending semantic features to broader domains and modalities while outlining limitations and directions for future research.

Abstract

Deep neural networks learn fragile "shortcut" features, rendering them difficult to interpret (black box) and vulnerable to adversarial attacks. This paper proposes semantic features as a general architectural solution to this problem. The main idea is to make features locality-sensitive in the adequate semantic topology of the domain, thus introducing a strong regularization. The proof of concept network is lightweight, inherently interpretable and achieves almost human-level adversarial test metrics - with no adversarial training! These results and the general nature of the approach warrant further research on semantic features. The code is available at https://github.com/314-Foundation/white-box-nn
Paper Structure (32 sections, 1 equation, 13 figures)

This paper contains 32 sections, 1 equation, 13 figures.

Figures (13)

  • Figure 1: Visualisation of semantic feature$f_{\mathbf{P}\mathbf{L}}$ and its matching mechanism (SFmatch, see Section \ref{['sec:SFmatch']}). The semantic feature $f_{\mathbf{P}\mathbf{L}}$ consists of a base feature$f$ and a set of its "small" modifications $\mathbf{L}(p_i)(f)$. In this case, the base feature has the same dimensions as the input image and its modifications are 2D affine transformations. $\mathrm{SFmatch}(d, f_{\mathbf{P}\mathbf{L}})$ takes the maximum of the scalar products $d\cdot\mathbf{L}(p_i)(f)$ thus identifying all $\mathbf{L}(p_i)(f)$. Affine transformations are typically parameterized by extended matrices of dimension $2\times3$; both the base feature and the parameters $p_i$ of its modifications are learned. If $f$ and its modifications are easily understood by humans, then the neural network layer composed of such features can be considered a white box layer. The precise definition of semantic feature is found in Section \ref{['sec:theory']}.
  • Figure 2: A typical DNN for MNIST can be arbitrarily fooled by adding semantically negligible noise.
  • Figure 3: Architecture of the PoC white box neural network. The first two layers consist of semantic features that operate per-pixel - the first layer takes into account only the pixel's value, while the second examines its $5\times5$ neighborhood to determine if the pixel lays on a "bright line". The first two layers retain the shape of the input image. The third layer comprises of 8 affine $f_{\mathbf{P}\mathbf{L}}$ as visualized in Figure \ref{['fig:SFmatch']}. The final layer consists of two logical $f_{\mathbf{P}\mathbf{L}}$: intuitively, the first one checks whether at least one affine $f_{\mathbf{P}\mathbf{L}}$ corresponding to "3" is active and none of the affine $f_{\mathbf{P}\mathbf{L}}$ corresponding to "5" are active; the second logical $f_{\mathbf{P}\mathbf{L}}$ works in the opposite way. Section \ref{['sec:architecture']} describes the architecture in more detail.
  • Figure 4: (left) Reliability test-time curve (see Section \ref{['sec:quantitative_results']} for details). (right) Learning curve. Both test and validation metrics presented here are computed under the classic 40-step PGD Attack.
  • Figure 5: Two Step layer - initial and learned. The visible smooth thresholding essentially means that the layer has learned the permissible perturbations (i.e. intervals) of real semantic features corresponding to the state of being OFF (number -1) and ON (number 1).
  • ...and 8 more figures