Learnability of high-dimensional targets by two-parameter models and gradient flow

Dmitry Yarotsky

Learnability of high-dimensional targets by two-parameter models and gradient flow

Dmitry Yarotsky

TL;DR

The main result shows that if the targets are described by a particular $d$-dimensional probability distribution, then there exist models with as few as two parameters that can learn the targets with arbitrarily high success probability.

Abstract

We explore the theoretical possibility of learning $d$-dimensional targets with $W$-parameter models by gradient flow (GF) when $W<d$. Our main result shows that if the targets are described by a particular $d$-dimensional probability distribution, then there exist models with as few as two parameters that can learn the targets with arbitrarily high success probability. On the other hand, we show that for $W<d$ there is necessarily a large subset of GF-non-learnable targets. In particular, the set of learnable targets is not dense in $\mathbb R^d$, and any subset of $\mathbb R^d$ homeomorphic to the $W$-dimensional sphere contains non-learnable targets. Finally, we observe that the model in our main theorem on almost guaranteed two-parameter learning is constructed using a hierarchical procedure and as a result is not expressible by a single elementary function. We show that this limitation is essential in the sense that most models written in terms of elementary functions cannot achieve the learnability demonstrated in this theorem.

Learnability of high-dimensional targets by two-parameter models and gradient flow

TL;DR

The main result shows that if the targets are described by a particular

-dimensional probability distribution, then there exist models with as few as two parameters that can learn the targets with arbitrarily high success probability.

Abstract

We explore the theoretical possibility of learning

-dimensional targets with

-parameter models by gradient flow (GF) when

. Our main result shows that if the targets are described by a particular

-dimensional probability distribution, then there exist models with as few as two parameters that can learn the targets with arbitrarily high success probability. On the other hand, we show that for

there is necessarily a large subset of GF-non-learnable targets. In particular, the set of learnable targets is not dense in

, and any subset of

homeomorphic to the

-dimensional sphere contains non-learnable targets. Finally, we observe that the model in our main theorem on almost guaranteed two-parameter learning is constructed using a hierarchical procedure and as a result is not expressible by a single elementary function. We show that this limitation is essential in the sense that most models written in terms of elementary functions cannot achieve the learnability demonstrated in this theorem.

Paper Structure (15 sections, 11 theorems, 17 equations, 4 figures)

This paper contains 15 sections, 11 theorems, 17 equations, 4 figures.

Introduction
The setting
General impossibility results
Almost guaranteed learning with two parameters
Models expressible by elementary functions
Discussion
Main takeaways.
Open questions.
Proof of Theorem \ref{['th:learn']}
The initial map $\Phi^{(0)}$.
Ensuring convergence $\inf_t L_{\mathbf f}(\mathbf w(t))=0$ for all $\mathbf f\in F_0$.
Ensuring $\mu(F_0)\ge 1-\epsilon$.
Proof of Theorem \ref{['th:main_pfaff']}
Proof of Proposition \ref{['prop:no_dense2']}
Proof of Corollary \ref{['corol:torus_lebesgue0']}

Key Result

Theorem 1

There exists an activation function $\sigma$ which is real analytic, strictly increasing, sigmoidal (i.e., $\lim_{x\to -\infty}\sigma(x)=0$ and $\lim_{x\to +\infty}\sigma(x)=1$), and such that any $f\in C([0,1]^n)$ can be uniformly approximated with any accuracy by expressions $\sum_{i=1}^{6n+3}d_i\

Figures (4)

Figure 1: Proof of Theorem \ref{['th:borsuk_ulam']}. GF cannot converge for all points of $G$: such a convergence would require $\Phi(\mathbf z_t)$ to be simultaneously close to both $g(\mathbf y_t)$ and $g(-\mathbf y_t)$, which are far from each other.
Figure 2: In Theorem \ref{['th:learn']}, we ensure that the learnable set of targets $F_\Phi$ contains a multidimensional "fat" Cantor set $F_0$ having almost full measure $\mu$. The set $F_0$ has the form $F_0=\cap_{n=1}^\infty \cup_\alpha B_\alpha^{(n)}$, where $\{B_\alpha^{(n)}\}_{n,\alpha}$ is a nested hierarchy of rectangular boxes in $\mathbb R^d$. Here, $n$ is the level of the hierarchy and $\alpha$ is the index of the box within the level.
Figure 3: The map $\Phi$ from Theorem \ref{['th:learn']} (see Section \ref{['sec:proof_main']} for details).
Figure 4: The curve $\Phi(w)=(\sin(w),\sin(\sqrt{2}w))$ densely fills the square $[-1,1]^2$, but for all targets $\mathbf f$ except for a set of Lebesgue measure 0 the respective GF trajectory $w(t)$ is trapped at a spurious local minimum so that $\Phi(w(t))\not\to\mathbf f$. Corollary \ref{['corol:torus_lebesgue0']} shows that this is true for all models \ref{['eq:torus_flow']} with any number of parameters $W<d$.

Theorems & Definitions (15)

Theorem 1: maiorov1999lower
Proposition 2
proof
Theorem 3
proof
Theorem 4
proof : Proof (see Figure \ref{['fig:borsuk-ulam']}).
Corollary 5
Theorem 6: \ref{['sec:proof_main']}
Theorem 7: Khovanskii, "elementary functions are Pfaffian"
...and 5 more

Learnability of high-dimensional targets by two-parameter models and gradient flow

TL;DR

Abstract

Learnability of high-dimensional targets by two-parameter models and gradient flow

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (15)