Table of Contents
Fetching ...

Absolute abstraction: a renormalisation group approach

Carlo Orientale Caputo, Elias Seiffert, Enrico Frausin, Matteo Marsili

TL;DR

The paper argues that universal, abstract representations—'absolute abstraction'—emerge at a fixed point of a renormalisation-group–like process that combines depth with breadth in data. It derives analytically that the fixed point $p^*(\mathbf{s})$ aligns with the Hierarchical Feature Model (HFM) under a maximal relevance principle, with a parameter $g$ linked to the coding cost and a critical point at $g_c=\log 2$. The authors demonstrate, through Deep Belief Networks and auto-encoders trained on increasingly broad datasets, that internal representations progressively approach the HFM as depth and breadth grow, supporting the RG-based theory. These results suggest a data-independent, universal scaffold for abstraction with implications for AI robustness and cognitive science.

Abstract

Abstraction is the process of extracting the essential features from raw data while ignoring irrelevant details. It is well known that abstraction emerges with depth in neural networks, where deep layers capture abstract characteristics of data by combining lower level features encoded in shallow layers (e.g. edges). Yet we argue that depth alone is not enough to develop truly abstract representations. We advocate that the level of abstraction crucially depends on how broad the training set is. We address the issue within a renormalisation group approach where a representation is expanded to encompass a broader set of data. We take the unique fixed point of this transformation -- the Hierarchical Feature Model -- as a candidate for a representation which is absolutely abstract. This theoretical picture is tested in numerical experiments based on Deep Belief Networks and auto-encoders trained on data of different breadth. These show that representations in neural networks approach the Hierarchical Feature Model as the data get broader and as depth increases, in agreement with theoretical predictions.

Absolute abstraction: a renormalisation group approach

TL;DR

The paper argues that universal, abstract representations—'absolute abstraction'—emerge at a fixed point of a renormalisation-group–like process that combines depth with breadth in data. It derives analytically that the fixed point aligns with the Hierarchical Feature Model (HFM) under a maximal relevance principle, with a parameter linked to the coding cost and a critical point at . The authors demonstrate, through Deep Belief Networks and auto-encoders trained on increasingly broad datasets, that internal representations progressively approach the HFM as depth and breadth grow, supporting the RG-based theory. These results suggest a data-independent, universal scaffold for abstraction with implications for AI robustness and cognitive science.

Abstract

Abstraction is the process of extracting the essential features from raw data while ignoring irrelevant details. It is well known that abstraction emerges with depth in neural networks, where deep layers capture abstract characteristics of data by combining lower level features encoded in shallow layers (e.g. edges). Yet we argue that depth alone is not enough to develop truly abstract representations. We advocate that the level of abstraction crucially depends on how broad the training set is. We address the issue within a renormalisation group approach where a representation is expanded to encompass a broader set of data. We take the unique fixed point of this transformation -- the Hierarchical Feature Model -- as a candidate for a representation which is absolutely abstract. This theoretical picture is tested in numerical experiments based on Deep Belief Networks and auto-encoders trained on data of different breadth. These show that representations in neural networks approach the Hierarchical Feature Model as the data get broader and as depth increases, in agreement with theoretical predictions.
Paper Structure (26 sections, 36 equations, 7 figures)

This paper contains 26 sections, 36 equations, 7 figures.

Figures (7)

  • Figure 1: Illustrative example of the RG in statistical physics (A) and of it's application in learning (B and C). The RG (A) entails a coarse gaining (or decimation) step, whereby low scale degrees of freedom are integrated out, e.g. with the introduction of block variables (large dots in the middle A-panel), and a rescaling step that restores the original size of the system. In a representation of a given domain of items (animals), in B coarse graining is performed zooming into those with a specific feature (living in water) and rescaling corresponds to enriching the representation by adding further details. The same procedure can be reversed (in C): the representation describing a particular domain (animals from planet Earth) is retrained on a wider domain (animals from many planets), neglecting small scale details (e.g. the difference between whales and dolphins).
  • Figure 2: Graphical representation of the transition matrix $T_{\mathbf{s}_{1:n},\mathbf{s}_{1:n}'}$ of the coarse graining RG for $n=3$. States are represented by circles and transitions by arrows. From each state $\mathbf{s}_{1:n}$ there are only three non-zero matrix elements of $T_{\mathbf{s}_{1:n},\mathbf{s}_{1:n}'}$. Two of them correspond to the possible states $\mathbf{s}_{1:n}'=(s_0,\mathbf{s}_{1:n-1})$ that can be reached by either adding $s_0=1$ to the left of $\mathbf{s}_{1:n}$ (red links) or adding $s_0=0$ (blue dashed links). Both transitions occur with probability $T_{\mathbf{s}_{1:n},\mathbf{s}_{1:n}'}=\frac{1-\alpha}{2}$. The third transition resets to the $\mathbf{0}_{1:n}$ state (green dotted links) and it occurs with probability $T_{\mathbf{s}_{1:n},\mathbf{0}_{1:n}}=\alpha$.
  • Figure 3: Color plot of the $D_{KL}$ divergence per node between the internal representation of a layer inside a DBN and the HFM ( a, b, c and d) and of the fitted parameter $g$ ( e)for the DBN in panel d). The internal representation is obtained sampling $10^5$ configurations $\mathbf{s}$ for each layer from the equilibrium distribution. The DBN's are trained on datasets of increasing breadth: in a), b) and c) M2 refers to a dataset obtained by transforming the digits $2$ of MNIST with translations and rotations, to obtain $N=6\cdot 10^4$ data points, and M refers to the complete MNIST dataset. In a) and b) MF refers to MNIST plus Fashion MNIST, MFE1 coincides with MF plus the letters from $a$ to $i$ of EMNIST, adding letters up to $r$ generates MFE2 and adding all letters yields MFE3. MFE3CC contains all the data of MFE3 plus Cifar-10 images, rescaled to $28\times 28$ pixels. The DBNs in a) and b) are trained for $10^5$ and $10^4$ epochs respectively. DBNs in c) and d) are also trained for $10^5$ epochs. In c) the order with which the data sets are learned is changed: After MNIST the DBN is trained first with letters in EMNIST and then with images of Fashion MNIST and finally Cifar-10 is added (ME3FCC). In d) the datasets are fragmented in different ways. First MNIST is divided into 5 parts (M1 with digits 0 and 1, M2 up to 3, M3 up to 5, M4 up to 7 and M5 with all digits) then EMNIST and Fashion MNIST are added as before. For this network, the fitted value of the parameter $g$ of the HFM is shown in panel e).
  • Figure 4: a) Colour plot of the transition matrix between digits (see text) generated by the representation in layer $\ell=1$ (top left), $4$ (top right), $7$ (bottom left) and $10$ (bottom right), of a DBN trained on MNIST. b) Distance $D(f_{\ell}^{\mathcal{D}},f_{\ell'}^{\mathcal{D}'})$ between the features $f_{\ell}^{\mathcal{D}}$ generated by states in layer $\ell$ of a DBN trained on dataset $\mathcal{D}$ and those ($f_{\ell'}^{\mathcal{D}'}$) generated by states in layer $\ell'$ of a DBN trained on dataset $\mathcal{D}'$. Features $f_{\ell}^{\mathcal{D}}=p(\mathbf{x}|\mathbf{s}_\ell=\mathbf{e}_i)$ are defined as the activity patterns in the visible layer when only variable $i$ of layer $\ell$ is active. In b) the dataset $\mathcal{D}$ coincides with the MNIST digits $0$ and $1$ (M1), while $\mathcal{D}'$ runs on all broader datasets $M2,\ldots,ME3F$.
  • Figure 5: a) Architectures of the AE used in this study. The hidden layers (white sticks) process information from the visible layer of the encoder (in grey, left) to the bottleneck layer (in red), which is then propagated through a specular sequence of hidden layers to the visible layer of the decoder (in grey). The architecture with $L+1$ layers is obtained adding an extra hidden layer between the last one of the AE with $L$ layers, on both sides of the bottleneck. b) Kullback-Leibler divergence from the marginal HFM $p^*$ of Eq. (\ref{['eq:pstar']}) of the latent representation of AEs with $n=12$ latent nodes and different depths. Each curve refers to a different detaset: 2M contains the point corresponding to digits $2$ of MNIST, 2M+ is derived from 2M adding data points obtained through translations and rotations of the original data points. M is MNIST, EM is EMNIST and FEM includes EMNIST and Fashion MNIST. The colour code of points corresponds to the fitted value of $g$. The inset reports the values of $g$ as a function of the Kullback-Leibler divergence, for different layers rendered by the colour code (same symbols as in the main figure). c) List of the ten most probable configurations of the latent space with their frequency for the AE with $6$ hidden layers trained on the FEM dataset.
  • ...and 2 more figures