Table of Contents
Fetching ...

A Truly Sparse and General Implementation of Gradient-Based Synaptic Plasticity

Jamie Lohoff, Anil Kaya, Florian Assmuth, Emre Neftci

TL;DR

This work presents a custom automatic differentiation (AD) pipeline for sparse and online implementation of gradient-based synaptic plasticity rules that generalizes to arbitrary neuron models and demonstrates how memory utilization scales with network size without dependence on the sequence length, as expected from forward AD methods.

Abstract

Online synaptic plasticity rules derived from gradient descent achieve high accuracy on a wide range of practical tasks. However, their software implementation often requires tediously hand-derived gradients or using gradient backpropagation which sacrifices the online capability of the rules. In this work, we present a custom automatic differentiation (AD) pipeline for sparse and online implementation of gradient-based synaptic plasticity rules that generalizes to arbitrary neuron models. Our work combines the programming ease of backpropagation-type methods for forward AD while being memory-efficient. To achieve this, we exploit the advantageous compute and memory scaling of online synaptic plasticity by providing an inherently sparse implementation of AD where expensive tensor contractions are replaced with simple element-wise multiplications if the tensors are diagonal. Gradient-based synaptic plasticity rules such as eligibility propagation (e-prop) have exactly this property and thus profit immensely from this feature. We demonstrate the alignment of our gradients with respect to gradient backpropagation on an synthetic task where e-prop gradients are exact, as well as audio speech classification benchmarks. We demonstrate how memory utilization scales with network size without dependence on the sequence length, as expected from forward AD methods.

A Truly Sparse and General Implementation of Gradient-Based Synaptic Plasticity

TL;DR

This work presents a custom automatic differentiation (AD) pipeline for sparse and online implementation of gradient-based synaptic plasticity rules that generalizes to arbitrary neuron models and demonstrates how memory utilization scales with network size without dependence on the sequence length, as expected from forward AD methods.

Abstract

Online synaptic plasticity rules derived from gradient descent achieve high accuracy on a wide range of practical tasks. However, their software implementation often requires tediously hand-derived gradients or using gradient backpropagation which sacrifices the online capability of the rules. In this work, we present a custom automatic differentiation (AD) pipeline for sparse and online implementation of gradient-based synaptic plasticity rules that generalizes to arbitrary neuron models. Our work combines the programming ease of backpropagation-type methods for forward AD while being memory-efficient. To achieve this, we exploit the advantageous compute and memory scaling of online synaptic plasticity by providing an inherently sparse implementation of AD where expensive tensor contractions are replaced with simple element-wise multiplications if the tensors are diagonal. Gradient-based synaptic plasticity rules such as eligibility propagation (e-prop) have exactly this property and thus profit immensely from this feature. We demonstrate the alignment of our gradients with respect to gradient backpropagation on an synthetic task where e-prop gradients are exact, as well as audio speech classification benchmarks. We demonstrate how memory utilization scales with network size without dependence on the sequence length, as expected from forward AD methods.
Paper Structure (13 sections, 11 equations, 7 figures, 3 tables)

This paper contains 13 sections, 11 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Exploitation of sparsity for the calculation of $G_{t}$ with Graphax. The sparse matrices are stored as lower dimensional dense representations for less memory and efficient element-wise operations.
  • Figure 2: LIF computational graph from equation \ref{['eqn:LIF']} with known partial derivatives assigned to each edge as described by \ref{['def:VertexElimination']}. Emphasis is made on the Kronecker deltas, which represent diagonal matrices, only edges from $\alpha$ and $\boldsymbol{z}_t$ are non diagonal. There are two outputs (in the aquamarine labels), however since $\boldsymbol{z}_{t+1}$ depends on the $\boldsymbol{u}_{t+1}$, the computational graph has only one output.
  • Figure 3: LIF computational graph as seen in Figure \ref{['fig:CompGraph']} with the first two steps of reverse-mode vertex elimination.The indices used for each Kronecker delta $\delta_{ij}$ are independent of one another. The independence of the Kronecker delta indices can best be seen between the two steps.
  • Figure 4: Evaluation time of a single step for 128 hidden neurons with changing training example time-steps for our e-prop implementation with Graphax, BPTT, naïve e-prop, and RTRL.
  • Figure 5: Evaluation time of a single step for changing number of hidden neurons with 1000 training example time-steps for our e-prop implementation with Graphax, BPTT, naïve e-prop, and RTRL.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 1