Towards White Box Deep Learning
Maciej Satkiewicz
TL;DR
The paper tackles the vulnerability and interpretability gap of deep neural networks by introducing semantic features—locality-sensitive invariants that regularize representations. It proposes a lightweight, four-layer PoC white-box network built from semantic features, including a Two Step Layer, a Convolutional Semantic Layer, an Affine Layer, and a Logical Layer, enabling human-aligned, interpretable features and robust behavior. On MNIST 3 vs 5, the model achieves ~92% adversarial accuracy under AutoAttack without adversarial training, ~98% adversarial precision at 80% recall, and ~99.5% clean accuracy, with ablations showing the critical contributions of initial layers and targeted augmentation. The work suggests that locality-engineered semantic features can improve robustness and interpretability, offering a blueprint for extending semantic features to broader domains and modalities while outlining limitations and directions for future research.
Abstract
Deep neural networks learn fragile "shortcut" features, rendering them difficult to interpret (black box) and vulnerable to adversarial attacks. This paper proposes semantic features as a general architectural solution to this problem. The main idea is to make features locality-sensitive in the adequate semantic topology of the domain, thus introducing a strong regularization. The proof of concept network is lightweight, inherently interpretable and achieves almost human-level adversarial test metrics - with no adversarial training! These results and the general nature of the approach warrant further research on semantic features. The code is available at https://github.com/314-Foundation/white-box-nn
