Visualizing Neural Network Imagination
Nevan Wichers, Victor Tao, Riccardo Volpato, Fazl Barez
TL;DR
This paper tackles interpretability by visualizing the hidden intermediate states a neural network represents while predicting a final environment state. It introduces an encoder–RNN–decoder architecture applied to Conway's Game of Life and augments it with autoencoder regularization and adversarial decoder training to produce GoL-like intermediate reconstructions from hidden representations. A thresholded pixel-matching metric assesses how well intermediate decodings align with ground-truth GoL states, and experiments reveal that architectural choices and training objectives influence interpretability, with autoencoder and adversarial training generally benefiting the results. The approach shows promise for revealing network 'imagination' in a controlled setting, though it encounters scalability limitations to more complex domains such as chess, highlighting both its potential and its current bounds.
Abstract
In certain situations, neural networks will represent environment states in their hidden activations. Our goal is to visualize what environment states the networks are representing. We experiment with a recurrent neural network (RNN) architecture with a decoder network at the end. After training, we apply the decoder to the intermediate representations of the network to visualize what they represent. We define a quantitative interpretability metric and use it to demonstrate that hidden states can be highly interpretable on a simple task. We also develop autoencoder and adversarial techniques and show that benefit interpretability.
