Table of Contents
Fetching ...

Generative Kaleidoscopic Networks

Harsh Shrivastava

Abstract

We discovered that the neural networks, especially the deep ReLU networks, demonstrate an `over-generalization' phenomenon. That is, the output values for the inputs that were not seen during training are mapped close to the output range that were observed during the learning process. In other words, the neural networks learn a many-to-one mapping and this effect is more prominent as we increase the number of layers or the depth of the neural network. We utilize this property of neural networks to design a dataset kaleidoscope, termed as `Generative Kaleidoscopic Networks'. Succinctly, if we learn a model to map from input $x\in\mathbb{R}^D$ to itself $f_\mathcal{N}(x)\rightarrow x$, the proposed `Kaleidoscopic sampling' procedure starts with a random input noise $z\in\mathbb{R}^D$ and recursively applies $f_\mathcal{N}(\cdots f_\mathcal{N}(z)\cdots )$. After a burn-in period duration, we start observing samples from the input distribution and the quality of samples recovered improves as we increase the depth of the model. Scope: We observed this phenomenon to various degrees for the other deep learning architectures like CNNs, Transformers & U-Nets and we are currently investigating them further.

Generative Kaleidoscopic Networks

Abstract

We discovered that the neural networks, especially the deep ReLU networks, demonstrate an `over-generalization' phenomenon. That is, the output values for the inputs that were not seen during training are mapped close to the output range that were observed during the learning process. In other words, the neural networks learn a many-to-one mapping and this effect is more prominent as we increase the number of layers or the depth of the neural network. We utilize this property of neural networks to design a dataset kaleidoscope, termed as `Generative Kaleidoscopic Networks'. Succinctly, if we learn a model to map from input to itself , the proposed `Kaleidoscopic sampling' procedure starts with a random input noise and recursively applies . After a burn-in period duration, we start observing samples from the input distribution and the quality of samples recovered improves as we increase the depth of the model. Scope: We observed this phenomenon to various degrees for the other deep learning architectures like CNNs, Transformers & U-Nets and we are currently investigating them further.
Paper Structure (6 sections, 3 equations, 8 figures)

This paper contains 6 sections, 3 equations, 8 figures.

Figures (8)

  • Figure 1: Manifold learning & Kaleidoscopic sampling. [left] During the manifold learning process, the deep ReLU networks or the Multilayer Perceptron (MLP) weights have their gradients enabled, shaded in blue, and their output units bounded between $(0,1)$ via a 'Sigmoid' non-linearity (or between $(-1,1)$ for 'Tanh'). They are learned as per Eq. \ref{['eq:manifold-learning']}. [right] During the sampling process, the weights of the neural network model are frozen and the input is a randomly initialized noise, sampled from a normal or uniform distribution. The model $f$ is repeatedly applied to the input noise in accordance to Eq. \ref{['eq:manifold-sampling']}. Once the function is applied 'B' number of times $f\circ f\circ \cdots \circ f$, we start obtaining the samples closer to the input data distribution $B\rightarrow K$.
  • Figure 2: Single point in 1D space. The top row shows the over-generalization phenomenon by performing manifold learning on a single point $x=0.5$, where range of $x\in(0, 1)$. A MLP with number of hidden layer $L=2$, hidden units size $H=5$ with 'ReLU' as the non-linearity in the middle layers and an entrywise 'Sigmoid' in the final layer was trained till the loss on the input data point was very low ($\leq 1e{-8}$, epochs $\geq 1K$). In other words, we overfit the MLP to the input data (we observed similar loss profiles with other choices of $L$ and $H$). (a) We plot the loss function values for the entire range of input. As expected, the value at $x=0.5$ is the lowest. (b) This is particularly interesting to observe that the output values observed of the MLP for the entire range of input values, is close to $0.5$ (as it matches input $x=0.5$ to same value at output). This close-up part on the right also shows that the neural network tend to learn many-to-one mapping. The bottom row shows the Kaleidoscopic sampling results for the MLP model. The initial noise (10 points) are shown in red and the recovered samples are shown in green. (c-e) The model is progressively applied $1\rightarrow 3$ times and the samples obtained are close to the training input, as expected.(best viewed in color)
  • Figure 3: Multiple points in 1D space. We run manifold learning on data points $X=\{0.2, 0.8\}$ by fitting a MLP with $H=5,L=2$ (top 2 rows) and $H=5,L=7$ (bottom 2 rows) and in each case, we run for epochs $\geq 1K$ to ensure that the training loss tends to $\rightarrow 0$. Intermediate layers had 'ReLU' non-linearity and 'Sigmoid' at the final layer. By design, the manifold learning is supposed to match the input and output. But, on the contrary, we can observe the over-generalization phenomenon in (b,h), where the MLP learns many-to-one mapping, as evident by the flat regions around the points $\{0.2, 0.8\}$. Scaled version of loss shown in (a,g) on the zoomed in part as the increasing and spreading the training points across the space makes the loss manifold flatter. The rows (c-f) and (i-l) shows intermediate instances of the sampling runs. The burn-in period is roughly around $B=5$, which indicates that the sampling converges fast. The loss manifold is relatively flat between 0.2 to 0.8 and that causes the samples to converge slower while running the sampling procedure. As compared to (b), in (h) we can find an increase in the flat region around input values $0.2$ and $0.8$, which in turn makes the curve connecting the flat regions more steeper and thus our sampling procedure works faster. As this ensures that the outputs (y-axis) are mapped closer to the training inputs points with increasing number of iterations of the Kaleidoscopic sampling procedure. (best viewed in color)
  • Figure 4: Multiple points in 1D space with NN bounded by 'Tanh'. We run manifold learning on data points $X=\{-0.8, -0.2, 0.2, 0.8\}$ by fitting a MLP with $H=10,L=10$ and the final layer non-linearity as 'Tanh'. This basically expands the range of the input and output between $[-1,1]$. The rows (c-f) shows intermediate instances of the sampling runs. We can observe the over-generalization phenomenon in (b), where the MLP learns many-to-one mapping, as evident by the flat regions around the points $X$. (best viewed in color)
  • Figure 5: Loss manifold and sampling in 2D space: (a) We do manifold learning in 2-dimensions with a MLP $L=2, H=50$ at point $x=(0.5,0.5)$ and (b) a MLP $L=7, H=50$ at the points $x=[(0.2,0.2), (0.2,0.8), (0.8,0.2), (0.8,0.8)]$. Each plot initializes a random noise (in red) sampled from a Normal distribution $\mathcal{N}(0.5, 0.5I)$ and ran Kaleidoscopic sampling whose samples are shown in green. The loss function hyperplane is quite flat but still a large number of samples are obtained near the input distribution which indicates that our sampling procedure is working which in turn suggests the existence of the 'over-generalization' phenomenon. We found that the number of sampling runs for reaching the burn-in period is inversely proportional to the complexity of the neural network. Note that the scatter points can look slightly off the manifold due perspective angle adjustments in 3D rendering. (best viewed in color)
  • ...and 3 more figures