A Unified Framework for Neural Computation and Learning Over Time

Stefano Melacci; Alessandro Betti; Michele Casoni; Tommaso Guidi; Matteo Tiezzi; Marco Gori

A Unified Framework for Neural Computation and Learning Over Time

Stefano Melacci, Alessandro Betti, Michele Casoni, Tommaso Guidi, Matteo Tiezzi, Marco Gori

TL;DR

The problem of learning over time is rethought from scratch, leveraging tools from optimal control theory, which yield a unifying view of the temporal dynamics of neural computations and learning.

Abstract

This paper proposes Hamiltonian Learning, a novel unified framework for learning with neural networks "over time", i.e., from a possibly infinite stream of data, in an online manner, without having access to future information. Existing works focus on the simplified setting in which the stream has a known finite length or is segmented into smaller sequences, leveraging well-established learning strategies from statistical machine learning. In this paper, the problem of learning over time is rethought from scratch, leveraging tools from optimal control theory, which yield a unifying view of the temporal dynamics of neural computations and learning. Hamiltonian Learning is based on differential equations that: (i) can be integrated without the need of external software solvers; (ii) generalize the well-established notion of gradient-based learning in feed-forward and recurrent networks; (iii) open to novel perspectives. The proposed framework is showcased by experimentally proving how it can recover gradient-based learning, comparing it to out-of-the box optimizers, and describing how it is flexible enough to switch from fully-local to partially/non-local computational schemes, possibly distributed over multiple devices, and BackPropagation without storing activations. Hamiltonian Learning is easy to implement and can help researches approach in a principled and innovative manner the problem of learning over time.

A Unified Framework for Neural Computation and Learning Over Time

TL;DR

The problem of learning over time is rethought from scratch, leveraging tools from optimal control theory, which yield a unifying view of the temporal dynamics of neural computations and learning.

Abstract

Paper Structure (13 sections, 43 equations, 4 figures, 1 algorithm)

This paper contains 13 sections, 43 equations, 4 figures, 1 algorithm.

Introduction
Preliminaries
Hamiltonian Learning
Recovering Gradient-based Learning
Leveraging Hamiltonian Learning
Related Work
Conclusions
Further Details
Optimal Control Theory
Robust Hamiltonian and Forward Hamiltonian Equations
Out-of-the Box Tools vs. Hamiltonian Learning
Feed-forward Networks and State Net
Learning in Recurrent Networks

Figures (4)

Figure 1: Left: State net (solid lines) and output net (dashed lines); Right: state ($\mathrm{\bf h}$ and ${ \textpdfrender{ TextRenderingMode=FillStroke, LineWidth=.4pt, }{\theta}}=[{ \textpdfrender{ TextRenderingMode=FillStroke, LineWidth=.4pt, }{\theta}}^{{\mathrm{\bf h}}}, { \textpdfrender{ TextRenderingMode=FillStroke, LineWidth=.4pt, }{\theta}}^{\mathrm{\bf y}}]$). Costate ($\mathcal{\bf z}$ and $\textpdfrender{ TextRenderingMode=FillStroke, LineWidth=.4pt, }{\omega}=[ \textpdfrender{ TextRenderingMode=FillStroke, LineWidth=.4pt, }{\omega}^{\mathrm{\bf h}}, \textpdfrender{ TextRenderingMode=FillStroke, LineWidth=.4pt, }{\omega}^{\mathrm{\bf y}}]$) is also shown (not part of the net).
Figure 2: Comparing HL with out-of-the-box Pytorch optimizers. Each figure is about a different dataset/architecture, reporting the loss $L$ in function of time (datasets and setup in Appendix \ref{['oob']}); there are $4$ scenarios: w/o (GD-a, GD-b) and w/ momentum (Mom-a, Mom-b). In HL, we implemented each net by solely considering the output function ( output), the state function ( state), or, in the case of RNNs and LSTMs, jointly considering both ( state/output)--the recurrent part is implemented with the state net, the rest is the output net. Curves of each scenario have the same color/linestyle, and a per-approach marker. The (absolute) difference between the weights yielded by the Pytorch optimizer and HL is zero in most cases or within the order of round-off error.Thus, curves of the same scenario overlap.
Figure 3: Experimental comparisons, Loss Values. We report the outcome of comparing gradient-based learning (with and without momentum, denoted with GD and Mom, respectively) using popular out-of-the-box tools (we tested two different configurations, denoted with the suffix "-a" and "-b", respectively, having different learning rates, momentum terms, damping factors) and Hamiltonian Learning (HL, setting $\tau$, $\beta$, $\eta$, $\phi$ to values that we theoretically show to be coherent with the parameters of out-of-the-box tools-- see Appendix \ref{['oob']}). When considering HL, we can implement the selected model by solely considering the output function (output), the state function (state), or we can split it putting a portion into the state and a portion into the output (state/output). The plot shows the perfect alignment in terms of cost function values during the models' training phase ($x$-axis, training steps).
Figure 4: Experimental comparisons, Accuracy. Same setting of Figure \ref{['fig:exp2']}. The plot shows the perfect alignment in terms of accuracy values during the models' training phase ($x$-axis, training steps).

A Unified Framework for Neural Computation and Learning Over Time

TL;DR

Abstract

A Unified Framework for Neural Computation and Learning Over Time

Authors

TL;DR

Abstract

Table of Contents

Figures (4)