Table of Contents
Fetching ...

Fundamental Components of Deep Learning: A category-theoretic approach

Bruno Gavranović

TL;DR

This thesis develops a novel mathematical foundation for deep learning based on the language of category theory that is end-to-end, end-to-end, unform, and not merely descriptive, but prescriptive, meaning it is amenable to direct implementation in programming languages with sufficient features.

Abstract

Deep learning, despite its remarkable achievements, is still a young field. Like the early stages of many scientific disciplines, it is marked by the discovery of new phenomena, ad-hoc design decisions, and the lack of a uniform and compositional mathematical foundation. From the intricacies of the implementation of backpropagation, through a growing zoo of neural network architectures, to the new and poorly understood phenomena such as double descent, scaling laws or in-context learning, there are few unifying principles in deep learning. This thesis develops a novel mathematical foundation for deep learning based on the language of category theory. We develop a new framework that is a) end-to-end, b) unform, and c) not merely descriptive, but prescriptive, meaning it is amenable to direct implementation in programming languages with sufficient features. We also systematise many existing approaches, placing many existing constructions and concepts from the literature under the same umbrella. In Part I we identify and model two main properties of deep learning systems parametricity and bidirectionality by we expand on the previously defined construction of actegories and Para to study the former, and define weighted optics to study the latter. Combining them yields parametric weighted optics, a categorical model of artificial neural networks, and more. Part II justifies the abstractions from Part I, applying them to model backpropagation, architectures, and supervised learning. We provide a lens-theoretic axiomatisation of differentiation, covering not just smooth spaces, but discrete settings of boolean circuits as well. We survey existing, and develop new categorical models of neural network architectures. We formalise the notion of optimisers and lastly, combine all the existing concepts together, providing a uniform and compositional framework for supervised learning.

Fundamental Components of Deep Learning: A category-theoretic approach

TL;DR

This thesis develops a novel mathematical foundation for deep learning based on the language of category theory that is end-to-end, end-to-end, unform, and not merely descriptive, but prescriptive, meaning it is amenable to direct implementation in programming languages with sufficient features.

Abstract

Deep learning, despite its remarkable achievements, is still a young field. Like the early stages of many scientific disciplines, it is marked by the discovery of new phenomena, ad-hoc design decisions, and the lack of a uniform and compositional mathematical foundation. From the intricacies of the implementation of backpropagation, through a growing zoo of neural network architectures, to the new and poorly understood phenomena such as double descent, scaling laws or in-context learning, there are few unifying principles in deep learning. This thesis develops a novel mathematical foundation for deep learning based on the language of category theory. We develop a new framework that is a) end-to-end, b) unform, and c) not merely descriptive, but prescriptive, meaning it is amenable to direct implementation in programming languages with sufficient features. We also systematise many existing approaches, placing many existing constructions and concepts from the literature under the same umbrella. In Part I we identify and model two main properties of deep learning systems parametricity and bidirectionality by we expand on the previously defined construction of actegories and Para to study the former, and define weighted optics to study the latter. Combining them yields parametric weighted optics, a categorical model of artificial neural networks, and more. Part II justifies the abstractions from Part I, applying them to model backpropagation, architectures, and supervised learning. We provide a lens-theoretic axiomatisation of differentiation, covering not just smooth spaces, but discrete settings of boolean circuits as well. We survey existing, and develop new categorical models of neural network architectures. We formalise the notion of optimisers and lastly, combine all the existing concepts together, providing a uniform and compositional framework for supervised learning.
Paper Structure (75 sections, 78 theorems, 189 equations, 60 figures)

This paper contains 75 sections, 78 theorems, 189 equations, 60 figures.

Key Result

Proposition 11

Let $(\cC, \otimes, I)$ be a monoidal category. Let be morphisms in $\cC \times \cC$. The interchange law tells us that the following equation holds describing that we get the same result if we first compose the morphisms in parallel, and then in sequence, or in sequence and then in parallel.

Figures (60)

  • Figure 1: The number of behaviours that can still be exhibited only by humans has steadily been shrinking. Graphic taken from kurzweil_age_1999.
  • Figure 2: Category Theory $\cap$ Machine Learning: cumulative number of papers through time. Data and figure taken from gavranovic_category_2020.
  • Figure 3: An informal illustration of gradient-based learning. This neural network is trained to distinguish different kinds of animals in the input image. Given an input $X$, the network predicts an output $Y$, which is compared by a loss function with what would be the correct answer (label). The loss function returns a real value expressing the error of the prediction; this information, together with the learning rate (a weight controlling how much the model should be changed in response to error) is used by an optimiser, which computes by gradient-descent the update of the parameters of the network, with the aim of improving its accuracy. The neural network, the loss map, the optimiser and the learning rate are all components of a supervised learning system, and can vary independently of one another.
  • Figure 4: In traditional methods it is easy to start building a system. But as the system grows, it becomes difficult to understand how all the moving pieces inside interact. This slows down development, and introduces bugs and side effects. In structural methods, the situation is the opposite. It takes longer to start, but as special care is taken in accounting for all the moving pieces, it becomes easier to manage their complexity, and scale these systems up. (Figure taken from breiner_workshop_2018)
  • Figure 5: String diagram representation of the morphism \ref{['eq:morphism_example_monoidal']}. Objects of $\cC$ are denoted as wires, and morphisms as boxes. Notably, unit $I$, the associator $\alpha$ and the unitor $\rho$ are completely invisible in the graphical representation.
  • ...and 55 more figures

Theorems & Definitions (282)

  • Remark 1: Differences between category, set, and graph theory.
  • Definition 2: $\ref{['def:set']}$
  • Definition 3: $\ref{['def:smooth']}$, cockett_reverse_2020
  • Definition 4: $\ref{['def:poly']} _R$, cockett_reverse_2020
  • Definition 5: $\ref{['def:fvect']} _F$
  • Definition 6: $\ref{['def:markov_kernel']}$ (compare fritz_synthetic_2020 and hedges_value_2023)
  • Definition 7: Natural numbers
  • Definition 8: Edge cases
  • Definition 9: Product of categories
  • Definition 10: $\ref{['def:cat']}$
  • ...and 272 more