Table of Contents
Fetching ...

Revisiting Gradient Descent: A Dual-Weight Method for Improved Learning

Xi Wang

TL;DR

The paper addresses limitations of traditional gradient descent by introducing a dual-weight decomposition at the neuron level, splitting each weight into $W_1$ and $W_2$ so that the effective weight is $W = W_1 - W_2$ and training updates separately reflect target versus non-target feature contrasts. The core method uses moving-average updates for each component and a mean-based formulation, inspired by excitatory–inhibitory dynamics, to improve robustness to noise and class imbalance while preserving the inference cost of the standard $WX + b$ setup. Empirical results across regression and classification tasks show improved generalization in regression, competitive accuracy in classification, and enhanced resilience to data sparsity and perturbations, albeit with some instability on certain datasets (e.g., CIFAR-10). The work situates itself relative to contrastive learning and dynamic regularization, offering a neuron-centric alternative to large-batch sample contrasts and providing practical code for replication. Overall, the proposed dual-weight updates present a conceptually and practically viable path toward more stable, contrast-aware learning with minimal inference overhead.

Abstract

We introduce a novel framework for learning in neural networks by decomposing each neuron's weight vector into two distinct parts, $W_1$ and $W_2$, thereby modeling contrastive information directly at the neuron level. Traditional gradient descent stores both positive (target) and negative (non-target) feature information in a single weight vector, often obscuring fine-grained distinctions. Our approach, by contrast, maintains separate updates for target and non-target features, ultimately forming a single effective weight $W = W_1 - W_2$ that is more robust to noise and class imbalance. Experimental results on both regression (California Housing, Wine Quality) and classification (MNIST, Fashion-MNIST, CIFAR-10) tasks suggest that this decomposition enhances generalization and resists overfitting, especially when training data are sparse or noisy. Crucially, the inference complexity remains the same as in the standard $WX + \text{bias}$ setup, offering a practical solution for improved learning without additional inference-time overhead.

Revisiting Gradient Descent: A Dual-Weight Method for Improved Learning

TL;DR

The paper addresses limitations of traditional gradient descent by introducing a dual-weight decomposition at the neuron level, splitting each weight into and so that the effective weight is and training updates separately reflect target versus non-target feature contrasts. The core method uses moving-average updates for each component and a mean-based formulation, inspired by excitatory–inhibitory dynamics, to improve robustness to noise and class imbalance while preserving the inference cost of the standard setup. Empirical results across regression and classification tasks show improved generalization in regression, competitive accuracy in classification, and enhanced resilience to data sparsity and perturbations, albeit with some instability on certain datasets (e.g., CIFAR-10). The work situates itself relative to contrastive learning and dynamic regularization, offering a neuron-centric alternative to large-batch sample contrasts and providing practical code for replication. Overall, the proposed dual-weight updates present a conceptually and practically viable path toward more stable, contrast-aware learning with minimal inference overhead.

Abstract

We introduce a novel framework for learning in neural networks by decomposing each neuron's weight vector into two distinct parts, and , thereby modeling contrastive information directly at the neuron level. Traditional gradient descent stores both positive (target) and negative (non-target) feature information in a single weight vector, often obscuring fine-grained distinctions. Our approach, by contrast, maintains separate updates for target and non-target features, ultimately forming a single effective weight that is more robust to noise and class imbalance. Experimental results on both regression (California Housing, Wine Quality) and classification (MNIST, Fashion-MNIST, CIFAR-10) tasks suggest that this decomposition enhances generalization and resists overfitting, especially when training data are sparse or noisy. Crucially, the inference complexity remains the same as in the standard setup, offering a practical solution for improved learning without additional inference-time overhead.

Paper Structure

This paper contains 26 sections, 35 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Both the excitatory neuron (top) and the inhibitory neuron (bottom) receive the same inputs $\{x_1, x_2, x_3\}$. The excitatory neuron uses $W_1$ as its weight vector, while the inhibitory neuron uses $W_2$. Their combined outputs then feed into a final output neuron (right).
  • Figure 2: Relative performance improvement over the gradient method: (Top) regression task, (Bottom) classification task.
  • Figure 3: Visualization of receptive fields of neurons across models
  • Figure 4: Results on WhiteWine dataset with models with 2 layers (Top row) and 3 layers (Bottom row). From left to right: Sample Size 100, 500, 1000.
  • Figure 5: Results on House dataset with models with 2 layers (Top row) and 3 layers (Bottom row). From left to right: Sample Size 100, 500, 1000.
  • ...and 3 more figures