Table of Contents
Fetching ...

The Good, The Efficient and the Inductive Biases: Exploring Efficiency in Deep Learning Through the Use of Inductive Biases

David W. Romero

TL;DR

This dissertation delves into the role of inductive biases, particularly continuous modeling and symmetry preservation, to address challenges related to its efficiency and sustainability and identifies promising future research avenues in the exploration of inductive biases for efficiency, and their wider implications for Deep Learning.

Abstract

The emergence of Deep Learning has marked a profound shift in machine learning, driven by numerous breakthroughs achieved in recent years. However, as Deep Learning becomes increasingly present in everyday tools and applications, there is a growing need to address unresolved challenges related to its efficiency and sustainability. This dissertation delves into the role of inductive biases -- particularly, continuous modeling and symmetry preservation -- as strategies to enhance the efficiency of Deep Learning. It is structured in two main parts. The first part investigates continuous modeling as a tool to improve the efficiency of Deep Learning algorithms. Continuous modeling involves the idea of parameterizing neural operations in a continuous space. The research presented here demonstrates substantial benefits for the (i) computational efficiency -- in time and memory, (ii) the parameter efficiency, and (iii) design efficiency -- the complexity of designing neural architectures for new datasets and tasks. The second focuses on the role of symmetry preservation on Deep Learning efficiency. Symmetry preservation involves designing neural operations that align with the inherent symmetries of data. The research presented in this part highlights significant gains both in data and parameter efficiency through the use of symmetry preservation. However, it also acknowledges a resulting trade-off of increased computational costs. The dissertation concludes with a critical evaluation of these findings, openly discussing their limitations and proposing strategies to address them, informed by literature and the author insights. It ends by identifying promising future research avenues in the exploration of inductive biases for efficiency, and their wider implications for Deep Learning.

The Good, The Efficient and the Inductive Biases: Exploring Efficiency in Deep Learning Through the Use of Inductive Biases

TL;DR

This dissertation delves into the role of inductive biases, particularly continuous modeling and symmetry preservation, to address challenges related to its efficiency and sustainability and identifies promising future research avenues in the exploration of inductive biases for efficiency, and their wider implications for Deep Learning.

Abstract

The emergence of Deep Learning has marked a profound shift in machine learning, driven by numerous breakthroughs achieved in recent years. However, as Deep Learning becomes increasingly present in everyday tools and applications, there is a growing need to address unresolved challenges related to its efficiency and sustainability. This dissertation delves into the role of inductive biases -- particularly, continuous modeling and symmetry preservation -- as strategies to enhance the efficiency of Deep Learning. It is structured in two main parts. The first part investigates continuous modeling as a tool to improve the efficiency of Deep Learning algorithms. Continuous modeling involves the idea of parameterizing neural operations in a continuous space. The research presented here demonstrates substantial benefits for the (i) computational efficiency -- in time and memory, (ii) the parameter efficiency, and (iii) design efficiency -- the complexity of designing neural architectures for new datasets and tasks. The second focuses on the role of symmetry preservation on Deep Learning efficiency. Symmetry preservation involves designing neural operations that align with the inherent symmetries of data. The research presented in this part highlights significant gains both in data and parameter efficiency through the use of symmetry preservation. However, it also acknowledges a resulting trade-off of increased computational costs. The dissertation concludes with a critical evaluation of these findings, openly discussing their limitations and proposing strategies to address them, informed by literature and the author insights. It ends by identifying promising future research avenues in the exploration of inductive biases for efficiency, and their wider implications for Deep Learning.

Paper Structure

This paper contains 299 sections, 8 theorems, 202 equations, 92 figures, 60 tables, 1 algorithm.

Key Result

Theorem 8.3.1

The attentive group convolution is an equivariant operator if and only if the attention operator $\mathcal{A}$ satisfies: If, moreover, the maps generated by $\mathcal{A}$ are invariant to one of its arguments, and, hence, exclusively attend to either the input or the output domain (Sec. sec:6_att_sequence), then $\mathcal{A}$ satisfies Eq. eq:6_equivconstraint if and only if it is equivariant an

Figures (92)

  • Figure 1: Convolutional vs fully connected layers. Convolutional layers share the same weights across all positions. Hence, the same mapping is applied (and learned) at all positions --here depicted by weights of the same color in \ref{['fig:0_conv']}. In contrast, fully connected layers apply (and learn) a unique set of weights to each position (\ref{['fig:0_fully']}). This increases the number of weights needed by $O(\mathrm{N^2})$ for an $\mathrm{N}{\times}{\mathrm{N}}$ image, and hampers generalization across positions. Since all features must be independently learned at each position, fully connected layers need $O(\mathrm{N^2})$ more data samples than convolutions.
  • Figure 2: Kernels learned by AlexNet krizhevsky2012imagenet. Despite their discrete parameterization, the convolutional kernels learn discrete approximations of continuous functions. Representing the kernels in a continuous domain facilitates the learning of such kernels.
  • Figure 3: Continuous Kernel Convolution (CKConv). CKConv views a convolutional kernel as a vector-valued continuous function $\boldsymbol{\psi}: {\mathbb{R}} \rightarrow {\mathbb{R}}^{\mathrm{N_{out}}\times \mathrm{N_{in}}}$ parameterized by a small neural network MLP$^{{\boldsymbol{\psi}}}$. MLP$^{{\boldsymbol{\psi}}}$ receives a time-step and outputs the value of the convolutional kernel at that position. We sample convolutional kernels by passing a set of relative positions $\{ \Delta\tau_i\}$ to MLP$^{{\boldsymbol{\psi}}}$, and perform convolution with the sampled kernel next. Since MLP$^{{\boldsymbol{\psi}}}$ is a continuous function, CKConvs can (i) construct arbitrarily large kernels , (ii) generate kernels at different resolutions , and (iii) handle irregular data.
  • Figure 4: Discrete centered, causal, and dilated causal convolutions.
  • Figure 5: Functional family of recurrent units, discrete convolutions and CKConvs. For max. eigenvalues of ${\mathbf{{W}}}$, $\lambda{\neq}1$, recurrent units are restricted to exponentially decreasing ($\lambda {\leq} 1$) or increasing ($\lambda {\geq} 1$) functions (Figs. \ref{['fig:2_vanishing_rnn']}, \ref{['fig:2_exploding_rnn']}). Discrete convolutions can describe arbitrary functions within their memory horizon but are zero otherwise (Fig. \ref{['fig:2_exploding_conv']}). Conversely, CKConvs define arbitrary long memory horizons, and thus are able to describe arbitrary functions upon the entire input sequence (Fig. \ref{['fig:2_exploding_ckconv']}).
  • ...and 87 more figures

Theorems & Definitions (24)

  • Theorem 8.3.1
  • Definition 9.4.0.1: Group equivariance
  • Example 9.4.0.1: Permutation equivariance
  • Definition 9.4.0.2: Equivariance of set functions to groups acting on homogeneous spaces
  • Example 9.4.0.2: Translation equivariance
  • Proposition 9.4.1
  • Definition 9.4.0.3: Unique ${\mathcal{G}}$-equivariance
  • Proposition 9.4.2
  • Proposition 9.4.3
  • Proposition 9.5.1
  • ...and 14 more