Table of Contents
Fetching ...

The Physics of Data and Tasks: Theories of Locality and Compositionality in Deep Learning

Alessandro Favero

TL;DR

The thesis addresses how neural networks learn high-dimensional tasks by exploiting latent data structure, focusing on locality and compositionality. It develops an analytical framework in the infinite-width limit showing locality enables beating the curse of dimensionality via a learning error ${\mathcal E}(P) \sim P^{-\beta}$ with a beta that depends on local structure rather than ambient dimension. It then proposes a hierarchical generative perspective via diffusion models and a Random Hierarchy Model to reveal phase transitions and polynomial-sample learning when composing data hierarchically, demonstrating a compositional grammar of data. Finally, it uncovers a form of compositionality in model weight space—weight disentanglement—where task vectors correspond to localized function changes, enabling task arithmetic and modular model editing. Together, these results provide a physics-inspired, multi-scale theory of data and tasks that connects data locality, hierarchical generation, and weight-space modularity to explain generalization, creativity, and editability in deep learning.

Abstract

Deep neural networks have achieved remarkable success, yet our understanding of how they learn remains limited. These models can learn high-dimensional tasks, which is generally statistically intractable due to the curse of dimensionality. This apparent paradox suggests that learnable data must have an underlying latent structure. What is the nature of this structure? How do neural networks encode and exploit it, and how does it quantitatively impact performance - for instance, how does generalization improve with the number of training examples? This thesis addresses these questions by studying the roles of locality and compositionality in data, tasks, and deep learning representations.

The Physics of Data and Tasks: Theories of Locality and Compositionality in Deep Learning

TL;DR

The thesis addresses how neural networks learn high-dimensional tasks by exploiting latent data structure, focusing on locality and compositionality. It develops an analytical framework in the infinite-width limit showing locality enables beating the curse of dimensionality via a learning error with a beta that depends on local structure rather than ambient dimension. It then proposes a hierarchical generative perspective via diffusion models and a Random Hierarchy Model to reveal phase transitions and polynomial-sample learning when composing data hierarchically, demonstrating a compositional grammar of data. Finally, it uncovers a form of compositionality in model weight space—weight disentanglement—where task vectors correspond to localized function changes, enabling task arithmetic and modular model editing. Together, these results provide a physics-inspired, multi-scale theory of data and tasks that connects data locality, hierarchical generation, and weight-space modularity to explain generalization, creativity, and editability in deep learning.

Abstract

Deep neural networks have achieved remarkable success, yet our understanding of how they learn remains limited. These models can learn high-dimensional tasks, which is generally statistically intractable due to the curse of dimensionality. This apparent paradox suggests that learnable data must have an underlying latent structure. What is the nature of this structure? How do neural networks encode and exploit it, and how does it quantitatively impact performance - for instance, how does generalization improve with the number of training examples? This thesis addresses these questions by studying the roles of locality and compositionality in data, tasks, and deep learning representations.

Paper Structure

This paper contains 324 sections, 27 theorems, 401 equations, 87 figures, 3 tables.

Key Result

Lemma 2.3.1

Denoting as ${\mathcal{K}}_{\mathrm{NTK}}^{FC}$ the NTK of a fully-connected network function acting on $s$-dimensional inputs and ${\mathcal{K}}_{\mathrm{NTK}}^{CN}$ the NTK of a convolutional network function (eq:loc-cnn) with filter size $s$ acting on $d$-dimensional inputs,

Figures (87)

  • Figure 1: Learning curves for different combinations of convolutional teachers with convolutional (left panels) and local (right panels) students. The teacher and student filter sizes are denoted with $t$ and $s$, respectively. Data are sampled uniformly in the hypercube $[0,1]^d$, with $d=9$ if not specified otherwise. Solid lines are the results of numerical experiments averaged over 128 realizations, and the shaded areas represent the empirical standard deviations. The predicted scaling is shown by dashed lines. All the panels are discussed in \ref{['sec:loc-empirical']}, while additional details on experiments are reported in \ref{['app:loc-numerics']}, together with additional experiments.
  • Figure 2: Left: Computational skeleton of a convolutional neural network of depth $L+1\,{=}\,4$ ($L\,{=}\,3$ hidden layers). The leaves of the graph (squares) correspond to input coordinates, and the root (empty circle) to the output. All other nodes represent (infinitely wide layers of) hidden neurons. We define as 'meta-patches' (i.e., patches of patches) the sets of input variables that share a common ancestor node along the tree (such as the squares within each colored rectangle). Each meta-patch coincides with the receptive field of the neuron represented by this common ancestor node, as indicated below the input coordinates. For each hidden layer $l\,{=}\,1,\dots,L$, there is a family of meta-patches having dimensionality $d_{\text{eff}}(l)$. Right: Sketches of learning curves ${\mathcal{E}}(P)$ obtained by learning target functions of varying spatial scale with the network on the left. More specifically, the target is a function of a $3$-dimensional patch for the blue curve, a $6$-dimensional patch for the orange curve, and the full input for the green curve. We predict (and confirm empirically) that both the decay of ${\mathcal{E}}$ with $P$ (full lines) and the rigorous upper bound (dashed lines) are controlled by the effective dimensionality of the target.
  • Figure 3: Learning curves for deep convolutional NTKs in a teacher-student setting. (a) Depth-four student learning depth-two, depth-three, and depth-four teachers. (b) Depth-three models cursed by the effective input dimensionality $d_{\mathrm{eff}}(L)$. The numbers inside brackets are the sequence of filter sizes of the kernels. Solid lines are the results of experiments averaged over 16 realizations with the shaded areas representing the empirical standard deviations. The predicted asymptotic scaling ${\mathcal{E}} \sim P^{-\beta}$ are reported as dashed lines. Details on the numerical experiments are reported in \ref{['app:deep-numerics']}.
  • Figure 4: Illustration of forward-backward experiments. Images generated by a denoising diffusion probabilistic model starting from the top-left image and inverting the dynamics at different times $t$. $T$ corresponds to the time scale when the forward diffusion process converges to an isotropic Gaussian distribution. At small $t$, the class of the generated image remains unchanged, with only alterations of low-level features, such as the eyes of the leopard. After a characteristic time $t$, the class undergoes a phase transition and changes. However, some low-level attributes of the original image are retained to compose the new image. For instance, the wolf is composed of eyes, nose, and ears similar to those of the leopard, and the butterfly inherits its colors and black spots.
  • Figure 5: Left:Examples of images generated by reverting the diffusion process at different times $t$. Starting from the left images ${\mathbf{x}}_0$ at time $t=0$, we generate samples $\hat{{\mathbf{x}}}_{0}(t)\sim p_\theta(\hat{{\mathbf{x}}}_{0}|{\mathbf{x}}_t)$ by first running the diffusion process up to time $t$ and then reverting it, as described in \ref{['sec:forward-backward-exp']}. At time $t=T$, ${\mathbf{x}}_T$ corresponds to isotropic Gaussian noise and the generated image $\hat{{\mathbf{x}}}_{0}(T)$ is uncorrelated from ${\mathbf{x}}_0$. At intermediate times, instead, a sudden change of the image class is observed, while some lower-level features are retained. Right:Cosine similarity between the post-activations of the hidden layers of a ConvNeXt Base liu2022convnet for the initial images ${\mathbf{x}}_0$ and the synthesized ones $\hat{{\mathbf{x}}}_{0}(t)$. Around $t \approx T/2$, the similarity between logits exhibits a sharp drop, indicating the change in class, while the hidden representations of the first layers change more smoothly. This indicates that certain low-level features from the original images are retained for composing the sampled images also after the class transition. To compute the cosine similarity, all activations are standardized, i.e., centered around the mean and scaled by the standard deviation computed on the $50000$ images of the ImageNet-1k validation set. At each time, the values of the cosine similarity correspond to the maximum of their empirical distribution over $10000$ images ($10$ per class of ImageNet-1k).
  • ...and 82 more figures

Theorems & Definitions (48)

  • Definition 2.3.1: one-hidden-layer CNN
  • Definition 2.3.2: one-hidden-layer LCN
  • Definition 2.3.3: Neural Tangent Kernel
  • Lemma 2.3.1
  • Lemma 2.3.2
  • Lemma 2.3.3: Spectra of convolutional kernels
  • Lemma 2.3.4: Spectra of local kernels
  • Theorem 2.4.1
  • Theorem 2.6.1
  • Corollary 2.6.1.1
  • ...and 38 more