Table of Contents
Fetching ...

Interplay between depth and width for interpolation in neural ODEs

Antonio Álvarez-López, Arselane Hadj Slimane, Enrique Zuazua

TL;DR

This work examines the interplay between the width p and the number of transitions between layers L, and establishes an explicit error decay rate with respect to p which results from applying a universal approximation theorem to a custom-built Lipschitz vector field interpolating D.

Abstract

Neural ordinary differential equations (neural ODEs) have emerged as a natural tool for supervised learning from a control perspective, yet a complete understanding of their optimal architecture remains elusive. In this work, we examine the interplay between their width $p$ and number of layer transitions $L$ (effectively the depth $L+1$). Specifically, we assess the model expressivity in terms of its capacity to interpolate either a finite dataset $D$ comprising $N$ pairs of points or two probability measures in $\mathbb{R}^d$ within a Wasserstein error margin $\varepsilon>0$. Our findings reveal a balancing trade-off between $p$ and $L$, with $L$ scaling as $O(1+N/p)$ for dataset interpolation, and $L=O\left(1+(p\varepsilon^d)^{-1}\right)$ for measure interpolation. In the autonomous case, where $L=0$, a separate study is required, which we undertake focusing on dataset interpolation. We address the relaxed problem of $\varepsilon$-approximate controllability and establish an error decay of $\varepsilon\sim O(\log(p)p^{-1/d})$. This decay rate is a consequence of applying a universal approximation theorem to a custom-built Lipschitz vector field that interpolates $D$. In the high-dimensional setting, we further demonstrate that $p=O(N)$ neurons are likely sufficient to achieve exact control.

Interplay between depth and width for interpolation in neural ODEs

TL;DR

This work examines the interplay between the width p and the number of transitions between layers L, and establishes an explicit error decay rate with respect to p which results from applying a universal approximation theorem to a custom-built Lipschitz vector field interpolating D.

Abstract

Neural ordinary differential equations (neural ODEs) have emerged as a natural tool for supervised learning from a control perspective, yet a complete understanding of their optimal architecture remains elusive. In this work, we examine the interplay between their width and number of layer transitions (effectively the depth ). Specifically, we assess the model expressivity in terms of its capacity to interpolate either a finite dataset comprising pairs of points or two probability measures in within a Wasserstein error margin . Our findings reveal a balancing trade-off between and , with scaling as for dataset interpolation, and for measure interpolation. In the autonomous case, where , a separate study is required, which we undertake focusing on dataset interpolation. We address the relaxed problem of -approximate controllability and establish an error decay of . This decay rate is a consequence of applying a universal approximation theorem to a custom-built Lipschitz vector field that interpolates . In the high-dimensional setting, we further demonstrate that neurons are likely sufficient to achieve exact control.
Paper Structure (18 sections, 13 theorems, 155 equations, 10 figures)

This paper contains 18 sections, 13 theorems, 155 equations, 10 figures.

Key Result

Theorem 1

Let $N\geq1$, $d\geq2$ and $T>0$ be fixed. Consider the dataset $\mathcal{D}$ as defined in sample. For any $p\geq1$, there exists a piecewise constant control such that the flow $\Phi_T(\cdot;W,A,\mathbf{b})$ generated by eq:node-p interpolates the dataset $\mathcal{D}$, i.e., Furthermore, the number of discontinuities of $(W,A,\mathbf{b})$ is

Figures (10)

  • Figure 1: Qualitative representation of models \ref{['eq:node-shallow']} and \ref{['eq:node-Narrow']} as discrete systems. Blue circles represent the input $\mathbf{x}$; switches depict ReLU functions; green circles indicate the result of $W\boldsymbol{\sigma}(A\mathbf{x}+\mathbf{b})$; orange circles represent the output after residual connections.
  • Figure 2: Left: separability condition in \ref{['hyp1']}, for $\mathbf{a}=\mathbf{e}_1$. Right: trajectories for exact control in the same example.
  • Figure 3: Construction of the Lipschitz field $\mathbf{V}$ in \ref{['prop:exactsmooth']} which interpolates $\mathcal{D}$ in a compact domain $\Omega$ that contains all the points and curves.
  • Figure 4: Left to right: Compression, parallel motion, expansion.
  • Figure 5: Left: Step 1. Fix $x^{(1)}$ and control $x^{(2)},\dots,x^{(d)}$. Right: Step 2. Control $x^{(1)}$ while $x^{(2)},\dots,x^{(d)}$ are fixed.
  • ...and 5 more figures

Theorems & Definitions (29)

  • Theorem 1
  • Remark 1
  • Remark 2
  • Corollary 2
  • Corollary 3
  • Remark 3
  • Proposition 4
  • Corollary 5
  • Proposition 6
  • Theorem 7
  • ...and 19 more