Table of Contents
Fetching ...

Expand Neurons, Not Parameters

Linghao Kong, Inimai Subramanian, Yonadav Shavit, Micah Adler, Dan Alistarh, Nir Shavit

TL;DR

This paper tackles polysemanticity and feature interference in neural networks by decoupling neuron count from non-zero parameter count through Fixed Parameter Expansion (FPE). By splitting neurons into edge-disjoint subneurons, FPE widens networks while preserving the parameter budget, reducing feature collisions and improving accuracy on both symbolic (Boolean DNF) tasks and real-world vision tasks using CLIP embeddings. The authors provide theoretical justification, show that even random splits reduce interference, and demonstrate scalability to deeper architectures and compatibility with structured sparsity and dynamic pruning. The findings offer an interpretable mechanism to exploit width to mitigate superposition, with practical implications for memory-limited accelerators where parameter movement dominates cost.

Abstract

This work demonstrates how increasing the number of neurons in a network without increasing its number of non-zero parameters improves performance. We show that this gain corresponds with a decrease in interference between multiple features that would otherwise share the same neurons. To reduce such entanglement at a fixed non-zero parameter count, we introduce Fixed Parameter Expansion (FPE): replace a neuron with multiple children and partition the parent's weights disjointly across them, so that each child inherits a non-overlapping subset of connections. On symbolic tasks, specifically Boolean code problems, clause-aligned FPE systematically reduces polysemanticity metrics and yields higher task accuracy. Notably, random splits of neuron weights approximate these gains, indicating that reduced collisions, not precise assignment, are a primary driver. Consistent with the superposition hypothesis, the benefits of FPE grow with increasing interference: when polysemantic load is high, accuracy improvements are the largest. Transferring these insights to real models (classifiers over CLIP embeddings and deeper multilayer networks) we find that widening networks while maintaining a constant non-zero parameter count consistently increases accuracy. These results identify an interpretability-grounded mechanism to leverage width against superposition, improving performance without increasing the number of non-zero parameters. Such a direction is well matched to modern accelerators, where memory movement of non-zero parameters, rather than raw compute, is the dominant bottleneck.

Expand Neurons, Not Parameters

TL;DR

This paper tackles polysemanticity and feature interference in neural networks by decoupling neuron count from non-zero parameter count through Fixed Parameter Expansion (FPE). By splitting neurons into edge-disjoint subneurons, FPE widens networks while preserving the parameter budget, reducing feature collisions and improving accuracy on both symbolic (Boolean DNF) tasks and real-world vision tasks using CLIP embeddings. The authors provide theoretical justification, show that even random splits reduce interference, and demonstrate scalability to deeper architectures and compatibility with structured sparsity and dynamic pruning. The findings offer an interpretable mechanism to exploit width to mitigate superposition, with practical implications for memory-limited accelerators where parameter movement dominates cost.

Abstract

This work demonstrates how increasing the number of neurons in a network without increasing its number of non-zero parameters improves performance. We show that this gain corresponds with a decrease in interference between multiple features that would otherwise share the same neurons. To reduce such entanglement at a fixed non-zero parameter count, we introduce Fixed Parameter Expansion (FPE): replace a neuron with multiple children and partition the parent's weights disjointly across them, so that each child inherits a non-overlapping subset of connections. On symbolic tasks, specifically Boolean code problems, clause-aligned FPE systematically reduces polysemanticity metrics and yields higher task accuracy. Notably, random splits of neuron weights approximate these gains, indicating that reduced collisions, not precise assignment, are a primary driver. Consistent with the superposition hypothesis, the benefits of FPE grow with increasing interference: when polysemantic load is high, accuracy improvements are the largest. Transferring these insights to real models (classifiers over CLIP embeddings and deeper multilayer networks) we find that widening networks while maintaining a constant non-zero parameter count consistently increases accuracy. These results identify an interpretability-grounded mechanism to leverage width against superposition, improving performance without increasing the number of non-zero parameters. Such a direction is well matched to modern accelerators, where memory movement of non-zero parameters, rather than raw compute, is the dominant bottleneck.

Paper Structure

This paper contains 29 sections, 7 figures, 5 tables, 3 algorithms.

Figures (7)

  • Figure 1: Paradigms of parameter efficiency in training and inference. "Train-then-sparsify" minimizes post-training inference cost, at the expense of training a large, dense model initially and of sparse fine-tuning. "Train-then-grow" amortizes some training cost via bootstrapping from a small, dense network, to yield a large, dense network that is more expensive to inference, though some techniques are constrained by final size. Here, we show theoretical justification and empirical feasibility of directly splitting neurons within a small, dense network into a large, sparse one.
  • Figure 2: Dense and clause-split FPE model Gram and $\mathbf{W_1}$ matrices. Clear codes are visible for clauses 2, 4, 6, and 7 for the dense model in the top row. Clear codes are visible for all clauses for the FPE model in the bottom row. Black x's represent masked parameters of value 0.
  • Figure 3: Trends of performance under superposition in neurons and clauses. (a) Relative improvement in test accuracy and the accuracy per parameter for models of varying hidden dimension on 8 clauses. (b) Relative improvement in test accuracy and the accuracy per parameter for models with 8 neurons. y-axis labels are shared for (a) and (b). (c) Heatmap of relative improvement percentage for clause-split split FPE models (top) and random-split FPE models (bottom). Relative improvement is calculated as $\frac{\text{FPE test accuracy} - \text{dense test accuracy}}{\text{dense test accuracy}}$. Error bars indicate one standard error of the mean. * indicates $p < 0.05$ and is shown only for $\alpha=4$ for clarity. Results collected over five trials.
  • Figure 4: Changes in feature interference metrics for varying neurons and clauses. (a) Feature capacity (top) and cosine similarity (bottom) fold change for models of varying hidden dimension on 8 clauses compared to baseline. (b) Feature capacity (top) and cosine similarity (bottom) fold change for models of varying number of clauses for 8 neurons compared to baseline. y-axis labels are shared for (a) and (b) and all metrics are normalized to dense values. A fold change of 1.0 represents dense metrics and is shown by a black dashed line. (c) Least-squares regressions on relative improvement versus feature capacity fold change (top) and neuron cosine similarity fold change (bottom) when varying neurons, with the coefficients of determination indicated. Dotted lines correspond to $\alpha=2$ and solid lines correspond to $\alpha=4$. Relative improvement is calculated as before. Error bars indicate one standard error of the mean. * indicates $p < 0.05$ and is shown only for $\alpha=4$ for clarity. Results collected over five trials.
  • Figure 5: Fixed Parameter Expansion helps on real datasets like (a) FashionMNIST, (b) CLIP-embeddings of CIFAR-100, (c) CLIP-embeddings of ImageNet-100, and (d) CLIP-embeddings of ImageNet-1k. For FashionMNIST, the baseline model was pre-trained for 20 epochs before FPE. For CIFAR-100, ImageNet-100, and ImageNet-1k, the baseline model was pre-trained for 25 epochs before FPE. The first row indicates the improvement in test accuracy of the FPE model relative to the dense baseline, for varying hidden dimensions, and is calculated as before. The second row shows the test accuracy per parameter for each configuration. In all figures, the number of neurons refers to the number of neurons pre-expansion. Relative improvement is calculated as before. Error bars indicate one standard error of the mean. * indicates $p < 0.05$. Results collected over five trials for (a-c) and ten trials for (d).
  • ...and 2 more figures