The unbearable lightness of Restricted Boltzmann Machines: Theoretical Insights and Biological Applications

Giovanni di Sarra; Barbara Bravi; Yasser Roudi

The unbearable lightness of Restricted Boltzmann Machines: Theoretical Insights and Biological Applications

Giovanni di Sarra, Barbara Bravi, Yasser Roudi

TL;DR

The role that the activation functions, describing the input-output relationship of single neurons in RBM, play in the functionality of these models is reviewed, including recent theoretical results on the benefits and limitations of different activation functions.

Abstract

Restricted Boltzmann Machines are simple yet powerful neural networks. They can be used for learning structure in data, and are used as a building block of more complex neural architectures. At the same time, their simplicity makes them easy to use, amenable to theoretical analysis, yielding interpretable models in applications. Here, we focus on reviewing the role that the activation functions, describing the input-output relationship of single neurons in RBM, play in the functionality of these models. We discuss recent theoretical results on the benefits and limitations of different activation functions. We also review applications to biological data analysis, namely neural data analysis, where RBM units are mostly taken to have sigmoid activation functions and binary units, to protein data analysis and immunology where non-binary units and non-sigmoid activation functions have recently been shown to yield important insights into the data. Finally, we discuss open problems addressing which can shed light on broader issues in neural network research.

The unbearable lightness of Restricted Boltzmann Machines: Theoretical Insights and Biological Applications

TL;DR

Abstract

Paper Structure (5 equations, 3 figures)

This paper contains 5 equations, 3 figures.

Figures (3)

Figure 1: RBM structure: two sets of variables $\{v_i\}_{i=1}^N$ and $\{z_{\mu}\}_{\mu =1}^M$ are organized into two layers, connected by weights $W_{i\mu}(v_i)$ living on a bipartite graph. Each set is subject to potentials $B_{i}$ and $U_{\mu}$ acting on single visible and hidden units, respectively.
Figure 2: RBM applications to protein data in immunology. A: T cells recognize cancer and infected cells via the binding of T-cell receptors to antigens presented on the cell's surface by HLA proteins. B: High-throughput experimental and sequencing platforms yield large datasets sampling the proteins involved in immune recognition (T-cell receptors, antigens). C: The RBM log-likelihood gives a probabilistic score that can discriminate functional from non-functional proteins: in this example, antigens presented by a specific HLA (including cancer-related antigens, in red) from generic non-presentable protein fragments. D: The RBM latent representations group protein sequences into functional subfamilies: in this example, antigens presented by different HLA types. Panels C and D were adapted from bravi2021. E: 2-step learning of diffRBM bravi2023, in this example applied to modeling antigen immunogenicity.
Figure 3: Effect of hidden units non-linearity on learning in simple RBMs. In the simple case of panel A, RBMs with $N=2$, and $M=1$ and different hidden potentials $U(z)$ are trained by maximizing the log-likelihood for data from $p_{LG}({\bf v})$ exactly. $\langle \cdot \rangle_{LG}$. B-E: The log-likelihood is computed exactly by enumerating the averages over $p_{LG}({\bf v})$ in the $(w,b)$ plane for 4 different potentials $U(z)$ corresponding to (B) step, (C) ReLU, (D) exponential, and (E) linear hidden activation functions. Inverting Eq.\ref{['eq:inter']}, the ($w,b$) values corresponding to $h=0, J=0.4$ are denoted by green crosses. These also correspond to the maximum likelihood values. The arrows represent the eigenvectors of the Hessian of the likelihood computed at its maximum, with eigenvalues reported in the legend. Although $\lambda_1$ is always close to zero, the difference in its amplitude between different activation function leads to a substantial difference in speed of convergence. The learning trajectory with the same number of gradient descent steps, learning rate and initial condition (red circle) are shown, ending at the cyan circle.