Learning Without Training

Ryan O'Dowd

Learning Without Training

Ryan O'Dowd

Abstract

Machine learning is at the heart of managing the real-world problems associated with massive data. With the success of neural networks on such large-scale problems, more research in machine learning is being conducted now than ever before. This dissertation focuses on three different projects rooted in mathematical theory for machine learning applications. The first project deals with supervised learning and manifold learning. In theory, one of the main problems in supervised learning is that of function approximation: that is, given some data set $\mathcal{D}=\{(x_j,f(x_j))\}_{j=1}^M$, can one build a model $F\approx f$? We introduce a method which aims to remedy several of the theoretical shortcomings of the current paradigm for supervised learning. The second project deals with transfer learning, which is the study of how an approximation process or model learned on one domain can be leveraged to improve the approximation on another domain. We study such liftings of functions when the data is assumed to be known only on a part of the whole domain. We are interested in determining subsets of the target data space on which the lifting can be defined, and how the local smoothness of the function and its lifting are related. The third project is concerned with the classification task in machine learning, particularly in the active learning paradigm. Classification has often been treated as an approximation problem as well, but we propose an alternative approach leveraging techniques originally introduced for signal separation problems. We introduce theory to unify signal separation with classification and a new algorithm which yields competitive accuracy to other recent active learning algorithms while providing results much faster.

Learning Without Training

Abstract

, can one build a model

? We introduce a method which aims to remedy several of the theoretical shortcomings of the current paradigm for supervised learning. The second project deals with transfer learning, which is the study of how an approximation process or model learned on one domain can be leveraged to improve the approximation on another domain. We study such liftings of functions when the data is assumed to be known only on a part of the whole domain. We are interested in determining subsets of the target data space on which the lifting can be defined, and how the local smoothness of the function and its lifting are related. The third project is concerned with the classification task in machine learning, particularly in the active learning paradigm. Classification has often been treated as an approximation problem as well, but we propose an alternative approach leveraging techniques originally introduced for signal separation problems. We introduce theory to unify signal separation with classification and a new algorithm which yields competitive accuracy to other recent active learning algorithms while providing results much faster.

Paper Structure (64 sections, 45 theorems, 380 equations, 28 figures, 6 tables, 2 algorithms)

This paper contains 64 sections, 45 theorems, 380 equations, 28 figures, 6 tables, 2 algorithms.

Introduction
Machine Learning Background
Machine Learning Paradigm for Supervised Learning
Examples of Hypothesis Spaces
Empirical Risk, Generalization, and Optimization
Approximation Theory
Universal Approximation
Degree of Approximation
Shortcomings of the Supervised Learning Paradigm
Constructive Approximation
Manifold Learning
Shortcomings of Classical Approximation Theory
Organization of the Thesis
Approximation on Manifolds
Introduction
...and 49 more sections

Key Result

Theorem 1.5.1

Let the marginal distribution of the points $\{x_j\}$ be $\mu_d^*$. Let $\gamma>0$ and $f$ belong to a smoothness class with parameter $\gamma$; i.e., If $n\gtrsim 1$ and $M\gtrsim n^{d+2\gamma}\log(n)$, then with probability going to $1$ as $M\to \infty$, we have

Figures (28)

Figure 1: A depiction of the standard supervised learning paradigm. The universe of discourse $\mathcal{X}$ is assumed to contain a target function $f$ and hypothesis spaces $V_n$ are judiciously chosen based on the algorithm of choice. $P^\#$ denotes the empirical risk minimizer, $\tilde{P}$ denotes the minimizer of the generalization error, and $P^*$ denotes the best approximation.
Figure 2: Comparison of recoveries for $f(\theta)=\left|\cos\theta\right|^{1/4}$ by the best approximation (black) and a good approximation (red) for degrees $n=63,127,255$. Figure credit: Hrushikesh Mhaskar.
Figure 3: Depiction of nonlinear width. Here $\mathbb{X}$ is a metric space of functions with $\mathbb{K}$ some subset. The goal is to understand the error associated with approximating any function $f\in \mathbb{K}$, where $\mathcal{P}$ is a continuous parameterization of $\mathbb{K}$ and $\mathcal{A}$ is an approximation scheme mapping into $\mathbb{X}$.
Figure 4: A depiction of a new machine learning paradigm, where one constructs an approximation $\sigma_n$ in the space $V_n$ directly from the data. This is done in such a way that one can also measure a direct reconstruction error from the approximation to the target function.
Figure 5: Error comparison between our method, the Nadaraya-Watson estimator, and an interpolatory RBF network. (Left) Comparison of absolute errors between the methods with the target function plotted on the right $y$-axis for benefit of the viewer. We note that the error from the RBF method is scaled by $10^{-3}$ so as to not dominate the figure. (Right) Percent point plot of the log absolute error for all three methods.
...and 23 more figures

Theorems & Definitions (105)

Theorem 1.5.1: mhaskar-survey
Theorem 1.6.1: devorenonlinear
Theorem 2.1.1: mhaskar2020deep
Theorem 2.1.2
Example 2.2.1
Remark 2.4.1
Proposition 2.4.1
proof
Proposition 2.4.2: mhaskarsphere
Remark 2.4.2
...and 95 more

Learning Without Training

Abstract

Learning Without Training

Authors

Abstract

Table of Contents

Key Result

Figures (28)

Theorems & Definitions (105)