Table of Contents
Fetching ...

ANDHRA Bandersnatch: Training Neural Networks to Predict Parallel Realities

Venkata Satya Sai Ajay Daliparthi

TL;DR

The paper tackles vanishing gradients in deep networks by introducing ANDHRA, a non-merging branching module that splits activations into parallel paths to create multiple heads. The Bandersnatch network (AB) uses a branching factor $N=2$ across three levels, producing $H_L=2^L$ heads and enabling ensemble-like predictions via majority voting while sharing early layers to preserve efficiency. Training jointly across heads uses a single global loss, and results on CIFAR-10/100 show statistically significant improvements for the top head over baselines, with the ensemble being inherent to the architecture. Across ablations, AB variants outperform baselines and various ensemble schemes, and supplementary experiments with Parametric Activations (PReLU) corroborate the robustness of the approach. This method offers a pathway to faster convergence and improved accuracy by leveraging parallel, independent branches without requiring extra inference cost beyond selecting the best head.

Abstract

Inspired by the Many-Worlds Interpretation (MWI), this work introduces a novel neural network architecture that splits the same input signal into parallel branches at each layer, utilizing a Hyper Rectified Activation, referred to as ANDHRA. The branched layers do not merge and form separate network paths, leading to multiple network heads for output prediction. For a network with a branching factor of 2 at three levels, the total number of heads is 2^3 = 8 . The individual heads are jointly trained by combining their respective loss values. However, the proposed architecture requires additional parameters and memory during training due to the additional branches. During inference, the experimental results on CIFAR-10/100 demonstrate that there exists one individual head that outperforms the baseline accuracy, achieving statistically significant improvement with equal parameters and computational cost.

ANDHRA Bandersnatch: Training Neural Networks to Predict Parallel Realities

TL;DR

The paper tackles vanishing gradients in deep networks by introducing ANDHRA, a non-merging branching module that splits activations into parallel paths to create multiple heads. The Bandersnatch network (AB) uses a branching factor across three levels, producing heads and enabling ensemble-like predictions via majority voting while sharing early layers to preserve efficiency. Training jointly across heads uses a single global loss, and results on CIFAR-10/100 show statistically significant improvements for the top head over baselines, with the ensemble being inherent to the architecture. Across ablations, AB variants outperform baselines and various ensemble schemes, and supplementary experiments with Parametric Activations (PReLU) corroborate the robustness of the approach. This method offers a pathway to faster convergence and improved accuracy by leveraging parallel, independent branches without requiring extra inference cost beyond selecting the best head.

Abstract

Inspired by the Many-Worlds Interpretation (MWI), this work introduces a novel neural network architecture that splits the same input signal into parallel branches at each layer, utilizing a Hyper Rectified Activation, referred to as ANDHRA. The branched layers do not merge and form separate network paths, leading to multiple network heads for output prediction. For a network with a branching factor of 2 at three levels, the total number of heads is 2^3 = 8 . The individual heads are jointly trained by combining their respective loss values. However, the proposed architecture requires additional parameters and memory during training due to the additional branches. During inference, the experimental results on CIFAR-10/100 demonstrate that there exists one individual head that outperforms the baseline accuracy, achieving statistically significant improvement with equal parameters and computational cost.

Paper Structure

This paper contains 19 sections, 14 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Comparison of training accuracy progression in baseline and proposed method AB (ANDHRA Bandersnatch), in log-scale graph
  • Figure 2: MWI based state changes
  • Figure 3: From the left side, baseline network, the levels & output shapes chart, and the ANDHRA Bandersnatch 2G network
  • Figure 4: ANDHRA module with PReLU
  • Figure 5: From the left side: levels chart, AB2GR3-2H1, AB2GR3-2H2, and AB2GR3-2H3 networks
  • ...and 4 more figures