Increasing biases can be more efficient than increasing weights

Carlo Metta; Marco Fantozzi; Andrea Papini; Gianluca Amato; Matteo Bergamaschi; Silvia Giulia Galfrè; Alessandro Marchetti; Michelangelo Vegliò; Maurizio Parton; Francesco Morandin

Increasing biases can be more efficient than increasing weights

Carlo Metta, Marco Fantozzi, Andrea Papini, Gianluca Amato, Matteo Bergamaschi, Silvia Giulia Galfrè, Alessandro Marchetti, Michelangelo Vegliò, Maurizio Parton, Francesco Morandin

TL;DR

The paper tackles the inefficiency of solely increasing weights to boost neural network performance by introducing Dendrite-Activated Connections (DAC) that use unshared biases and pre-activation to preserve information as it flows between layers. DAC replaces the standard post-activation with per-connection biases: $y_{i,j} = \varphi(b_{i,j} + z_j)$ and $z_i = \sum_j w_{i,j} y_{i,j}$, enabling greater per-parameter expressivity. Empirically, DAC yields consistent accuracy gains across SGEMM regression, CIFAR-10/100, Imagenette/Imagewoof, and ISIC datasets, with modest parameter and FLOP increases, and ablation studies show pre-activation with unshared biases often outperforms alternatives that modify only activations or rely on replicated inputs. Theoretically, DAC enhances representational power (e.g., PL_k can be represented with 2k DAC parameters versus 3k+1 for standard nets) and enables granular gradient masking, supporting more efficient information propagation. Overall, the work demonstrates that increasing biases can be a more efficient route to performance gains than increasing weights, with broad implications for architecture design and information flow in neural networks.

Abstract

We introduce a novel computational unit for neural networks that features multiple biases, challenging the traditional perceptron structure. This unit emphasizes the importance of preserving uncorrupted information as it is passed from one unit to the next, applying activation functions later in the process with specialized biases for each unit. Through both empirical and theoretical analyses, we show that by focusing on increasing biases rather than weights, there is potential for significant enhancement in a neural network model's performance. This approach offers an alternative perspective on optimizing information flow within neural networks. See source code at https://github.com/CuriosAI/dac-dev.

Increasing biases can be more efficient than increasing weights

TL;DR

and

, enabling greater per-parameter expressivity. Empirically, DAC yields consistent accuracy gains across SGEMM regression, CIFAR-10/100, Imagenette/Imagewoof, and ISIC datasets, with modest parameter and FLOP increases, and ablation studies show pre-activation with unshared biases often outperforms alternatives that modify only activations or rely on replicated inputs. Theoretically, DAC enhances representational power (e.g., PL_k can be represented with 2k DAC parameters versus 3k+1 for standard nets) and enables granular gradient masking, supporting more efficient information propagation. Overall, the work demonstrates that increasing biases can be a more efficient route to performance gains than increasing weights, with broad implications for architecture design and information flow in neural networks.

Abstract

Paper Structure (14 sections, 1 theorem, 18 equations, 11 figures, 3 tables)

This paper contains 14 sections, 1 theorem, 18 equations, 11 figures, 3 tables.

Introduction
Model
Related work
Methods
Experiments
Theoretical discussion
Conclusions
Acknowledgements.
Biological inspiration
Error rate estimation
Representation power
Separation of sets in low dimension
Optimal use of parameters for piecewise-linear functions
Additional plots

Key Result

Theorem C.1

Let $k$ be a positive integer, and let $\text{PL}_k$ be the set of continuous piecewise linear functions $\mathbb{R}\to\mathbb{R}$ consisting of exactly $k$ linear components. Then:

Figures (11)

Figure 1: Standard connection between two consecutive layers. The output layer (pink) is fully connected and has two units labelled 4 and 5. The input layer has three units: $\mathcal{I}_4=\mathcal{I}_5=\{1,2,3\}$. Bullets and rectangles represent linear aggregation and nonlinear filters from \ref{['eq:standard']}, respectively. Units 4 and 5 must share the same biases $b_1,b_2,b_3$ in the activation of their inputs.
Figure 2: Same structure as in Figure \ref{['fig:classical_unit']} with post-activation replaced by pre-activation with unshared biases. Rectangles and bullets represent nonlinear filters and linear aggregations from \ref{['eq:dac_equation']}, respectively. The biases in the activation between the input and the output layer depend both on the input node (1, 2 or 3) and the output node (4 or 5), and so, from the point of view of the output units, we refer to them as unshared.
Figure 3: Aggregated and averaged results for the SGEMM regression task. Experiments are grouped by network shape (pyramidal or rectangular, see text) and width. Error bars represent the sample standard deviation of the values concurring to the average. Fully connected networks with DACs perform better than the baseline for larger widths, and similarly or worse for smaller widths, when the general performance of the network is far from optimal.
Figure 4: Efficiency analysis of unshared biases for the SGEMM regression task. Rectangular baseline networks were compared with models with double the parameters: either by adding weights (orange) or by making biases unshared (blue). The resulting variations of the MSE are shown (negative means improvement). Error bars represent the sample standard deviation of the values concurring to the average (see text).
Figure 5: VGG, average test error. Compared performances of VGG 20 layers, 16 channels, and VGG 14 layers, 32 channels with shared biases (baseline, orange) and unshared biases (DAC, blue) on CIFAR-10 and CIFAR-100. Test error (vertical axis) is averaged over 5 replicates and over 5 epochs (see text). Error bars are 95% confidence intervals for the true mean value. Complexity (horizontal axis) is measured in GFLOPs per forward pass.
...and 6 more figures

Theorems & Definitions (5)

Remark 2.1
Remark 5.1
Theorem C.1
proof
Remark C.2

Increasing biases can be more efficient than increasing weights

TL;DR

Abstract

Increasing biases can be more efficient than increasing weights

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (5)