ANDHRA Bandersnatch: Training Neural Networks to Predict Parallel Realities
Venkata Satya Sai Ajay Daliparthi
TL;DR
The paper tackles vanishing gradients in deep networks by introducing ANDHRA, a non-merging branching module that splits activations into parallel paths to create multiple heads. The Bandersnatch network (AB) uses a branching factor $N=2$ across three levels, producing $H_L=2^L$ heads and enabling ensemble-like predictions via majority voting while sharing early layers to preserve efficiency. Training jointly across heads uses a single global loss, and results on CIFAR-10/100 show statistically significant improvements for the top head over baselines, with the ensemble being inherent to the architecture. Across ablations, AB variants outperform baselines and various ensemble schemes, and supplementary experiments with Parametric Activations (PReLU) corroborate the robustness of the approach. This method offers a pathway to faster convergence and improved accuracy by leveraging parallel, independent branches without requiring extra inference cost beyond selecting the best head.
Abstract
Inspired by the Many-Worlds Interpretation (MWI), this work introduces a novel neural network architecture that splits the same input signal into parallel branches at each layer, utilizing a Hyper Rectified Activation, referred to as ANDHRA. The branched layers do not merge and form separate network paths, leading to multiple network heads for output prediction. For a network with a branching factor of 2 at three levels, the total number of heads is 2^3 = 8 . The individual heads are jointly trained by combining their respective loss values. However, the proposed architecture requires additional parameters and memory during training due to the additional branches. During inference, the experimental results on CIFAR-10/100 demonstrate that there exists one individual head that outperforms the baseline accuracy, achieving statistically significant improvement with equal parameters and computational cost.
