How Deep Networks Learn Sparse and Hierarchical Data: the Sparse Random Hierarchy Model

Umberto Tomasini; Matthieu Wyart

How Deep Networks Learn Sparse and Hierarchical Data: the Sparse Random Hierarchy Model

Umberto Tomasini, Matthieu Wyart

TL;DR

This work addresses why high-dimensional data are learnable by linking deep networks' hierarchical representations to insensitivity to discrete transformations. It introduces the Sparse Random Hierarchy Model (SRHM), showing that sparsity within hierarchical generative structures induces invariances to discrete diffeomorphisms and that a hierarchical representation emerges precisely when such invariances are learned, with the training size needed quantified by architecture-specific polynomial scalings. The authors derive sample-complexity predictions for locally connected nets and CNNs, demonstrate that invariances to synonyms and diffeomorphisms emerge at the same $P^*$ as task learning, and offer a heuristic gradient-descent argument explaining how sparsity drives joint learning and stability. This framework unifies hierarchical representations with task invariances, providing insight into why deep networks beat the curse of dimensionality and suggesting avenues for analyzing unsupervised and structured representations in neural networks.

Abstract

Understanding what makes high-dimensional data learnable is a fundamental question in machine learning. On the one hand, it is believed that the success of deep learning lies in its ability to build a hierarchy of representations that become increasingly more abstract with depth, going from simple features like edges to more complex concepts. On the other hand, learning to be insensitive to invariances of the task, such as smooth transformations for image datasets, has been argued to be important for deep networks and it strongly correlates with their performance. In this work, we aim to explain this correlation and unify these two viewpoints. We show that by introducing sparsity to generative hierarchical models of data, the task acquires insensitivity to spatial transformations that are discrete versions of smooth transformations. In particular, we introduce the Sparse Random Hierarchy Model (SRHM), where we observe and rationalize that a hierarchical representation mirroring the hierarchical model is learnt precisely when such insensitivity is learnt, thereby explaining the strong correlation between the latter and performance. Moreover, we quantify how the sample complexity of CNNs learning the SRHM depends on both the sparsity and hierarchical structure of the task.

How Deep Networks Learn Sparse and Hierarchical Data: the Sparse Random Hierarchy Model

TL;DR

as task learning, and offer a heuristic gradient-descent argument explaining how sparsity drives joint learning and stability. This framework unifies hierarchical representations with task invariances, providing insight into why deep networks beat the curse of dimensionality and suggesting avenues for analyzing unsupervised and structured representations in neural networks.

Abstract

Paper Structure (17 sections, 11 equations, 17 figures, 1 table)

This paper contains 17 sections, 11 equations, 17 figures, 1 table.

Introduction
Our contributions
Prior work
Background: hierarchical generative models
Sparsity and stability to diffeomorphisms
Sample complexity
Learning invariant representation
Sample complexities arguments
Limitations
Conclusions
Sparsity B
Common architectures learning the SRHM
Learning the SRHM with Gradient Descent
Sample complexities and learnt representations for LCNs
Sample complexities and learnt representations for CNNs
...and 2 more sections

Figures (17)

Figure 1: CIFAR 10: (A) Test error vs sensitivity to diffeomorphisms of common architectures trained on all CIFAR10, showing a remarkable correlation between the two quantities. A grey line, corresponding to a power-law, guides the eye. (B) Same as (A), for increasing size of the training set $P$, whose value is indicated by the degree of opacity. The sensitivity to smooth transformations is computed in relative terms to the sensitivity to white noise. (A, B) are adapted from petrini_relative_2021. The SRHM captures these observations: (C) Test error vs sensitivity to diffeomorphisms of a CNN trained with $P=7400$ on the SHRM model, with parameters $L=s=s_0=2$ and $n_c=m=10$. The sensitivity to diffeomorphisms is defined as the change of network output induced by a diffeomorphism applied on the input, see Eq. \ref{['eq:d_2']}. For details about the architectures and their training process, see \ref{['app:sens_testerror_newnets']}. (D) Same as (C) for sensitivity to exchange of synonyms, defined as the change of the network output induced by an exchange of synonyms (defined in Section 2) applied on the input, see Eq. \ref{['eq:s_2']}. (E) and (F): as top panels (C) and (D), for increasing $P$ (increasing opacity).
Figure 2: (1)On the top of that panel, an instance of the production rules of a Random Hierarchical Model (RHM) with $n_c=2$ classes, $L=2$, $m=3$ synonyms per feature, vocabulary size $v=3$, and $s=2$. Here the classes set is $\mathcal{C}=\{\text{green, orange}\}$, the high-level features vocabulary is $\mathcal{V}_2=\{\text{red, blue, purple}\}$ and the low-level features vocabulary is $\mathcal{V}_1=\{\text{turquoise, pink, green}\}$. On the bottom, a couple of examples generated via the rules above are shown. The first example is generated by the production rules in the black boxes (i.e. the green label generates (red,blue), which themselves generate the couples (turquoise, pink) and (pink, green). (2) Top: effect of a diffeomorphism $\tau$ on a dog. The blue arrows represent the displacement field induced by $\tau$. Bottom: effect of a diffeomorphism $\tau$ on an instance of a sparse generative hierarchical task. (3) Different definitions of sparsity. (A) Each one of the $s$ informative features is embedded in a sub-patch of size $s_0+1$ with strictly $s_0$ uninformative elements, yielding a patch of $s(s_0+1)$ elements. (B) The $s$ informative features can occupy any position within the patch of $s(s_0+1)$ elements. In both cases, all the possible rearrangements are shown for $s=2$ and $s_0=1$. At the next production rule, each uninformative element generates an empty patch of $s(s_0+1)$ uninformative elements, as pictured in (4). (5) Four data sampled from the generative hierarchical task shown in panel (1) with sparsity (A). The first two examples follow the rules in black boxes in panels (1) and (4), showcasing different feature rearrangements.
Figure 3: Networks: (a) Locally Connected Network (LCN). Each neuron's weight focuses on a single input element (in red). Networks have $L$ hidden layers, with filters matching patches of size $s(s_0+1)$ from the generative process in \ref{['fig:rhm_all']} (here $L=2$, $s=2$, $s_0=1$). A last fully connected layer connects the output of the last local layer with the output. (b) Convolutional Neural Network (CNN) with the structure of (a), featuring weight sharing such that each weight considers different pixels in all patches of size $s(s_0+1)$ (in red).
Figure 4: (A) Test error $\varepsilon(P)$ versus number of training points $P$. To extract the sample complexity $P^*$, we fix an arbitrary threshold $\varepsilon^*=\varepsilon(P^*)$. Here $\varepsilon^*=10\%$. (B) Empirical sample complexity $P^*$ for a LCN to reach a $10\%$ test error $\varepsilon$ versus estimation of Eq. \ref{['eq:pstar_LCN']} for $s=2$, different depths $L$ (red for $L=2$, blue for $L=3$), different vocabulary sizes $v$ (different darkness), number of classes $n_c=v$, maximal $m=v^{s-1}$ and different $s_0$ (different markers). (C) Same as (B) for CNNs, supporting Eq. \ref{['eq:pstar_CNN']}. Further support for Eq. \ref{['eq:pstar_LCN']} and Eq. \ref{['eq:pstar_CNN']} is obtained by varying $s$, as shown in \ref{['app:sens_testerror_lcn']}, \ref{['fig:tasksempos_lcn_s3']} and \ref{['app:sens_testerror_cnn']}, \ref{['fig:task_cnn_s3']}.
Figure 5: Sample complexity of LCN learning the SRHM for varying input dimension $d$ and input relevant fraction $F$ at the maximal case $m=v^{s-1}$, with $v=10$ and $s=5$, according to Eq. \ref{['eq:pred_lcn_df']}. The color map is in log scale. At fixed dimension $d$, a smaller $F$ (hence higher sparsity) makes the task easier.
...and 12 more figures

How Deep Networks Learn Sparse and Hierarchical Data: the Sparse Random Hierarchy Model

TL;DR

Abstract

How Deep Networks Learn Sparse and Hierarchical Data: the Sparse Random Hierarchy Model

Authors

TL;DR

Abstract

Table of Contents

Figures (17)