Introduction to Machine Learning

Laurent Younes

Introduction to Machine Learning

Laurent Younes

TL;DR

This book offers a rigorous, math-centric foundation for machine learning, starting with a comprehensive treatment of notation across linear algebra, topology, calculus, and probability. It builds from core matrix analysis to optimization, covering unconstrained and constrained settings, stochastic optimization (including SGD and ADAM), and proximal methods, all within a convex-analysis framework. The text links these mathematical tools to ML algorithms and learning theory, providing the theoretical underpinnings for generalization, convergence, and algorithmic design. Its emphasis on measure-theoretic probability, spectral theory, and variational methods aims to equip readers with a deep, principled understanding of ML primitives and their limitations, informing robust application and research.

Abstract

This book introduces the mathematical foundations and techniques that lead to the development and analysis of many of the algorithms that are used in machine learning. It starts with an introductory chapter that describes notation used throughout the book and serve at a reminder of basic concepts in calculus, linear algebra and probability and also introduces some measure theoretic terminology, which can be used as a reading guide for the sections that use these tools. The introductory chapters also provide background material on matrix analysis and optimization. The latter chapter provides theoretical support to many algorithms that are used in the book, including stochastic gradient descent, proximal methods, etc. After discussing basic concepts for statistical prediction, the book includes an introduction to reproducing kernel theory and Hilbert space techniques, which are used in many places, before addressing the description of various algorithms for supervised statistical learning, including linear methods, support vector machines, decision trees, boosting, or neural networks. The subject then switches to generative methods, starting with a chapter that presents sampling methods and an introduction to the theory of Markov chains. The following chapter describe the theory of graphical models, an introduction to variational methods for models with latent variables, and to deep-learning based generative models. The next chapters focus on unsupervised learning methods, for clustering, factor analysis and manifold learning. The final chapter of the book is theory-oriented and discusses concentration inequalities and generalization bounds.

Introduction to Machine Learning

TL;DR

Abstract

Paper Structure (447 sections, 203 theorems, 2520 equations, 25 figures, 51 algorithms)

This paper contains 447 sections, 203 theorems, 2520 equations, 25 figures, 51 algorithms.

General Notation and Background Material
Linear algebra
Sets and functions
Vectors
Matrices
Multilinear maps
Topology
Open and closed sets in $\mathbb R^d$
Compact sets
Metric spaces
Calculus
Differentials
Important examples
Higher order derivatives
Taylor's theorem
...and 432 more sections

Key Result

theorem 1

Let $A, B\in {\mathcal{M}}_{n,d}({\mathbb{R}})$ have singular values $(\lambda_1, \ldots, \lambda_m)$ and $(\mu_1, \ldots, \mu_m)$, respectively, where $m = \min(n,d)$. Assume that these eigenvalues are listed in decreasing order so that $\lambda_1\geq \cdots\geq \lambda_m$ and $\mu_1\geq \cdots\geq Moreover, if ${\mathrm{trace}}(A^TB) = \sum_{i=1}^m \lambda_i\mu_i$, then there exist $n\times n$ a

Figures (25)

Figure 1: Kernel density estimators using a Gaussian kernel and various values of $\sigma$ when the true distribution of the data is a standard Gaussian (Orange: true density; Blue: estimated density, Red dots: training data).
Figure 2: Kernel density estimators using a Gaussian kernel and various values of $\sigma$ when the true distribution of the data is a Gamma distribution with parameter 2 (Orange: true density; Blue: estimated density, Red dots: training data).
Figure 3: Sources of errors in statistical Learning: When $P^*$ is the distribution of the data, the optimal predictor $f^*$ minimizes the expected loss function. Based on data $Z_1, \ldots, Z_N$, the sample-based distribution is $\hat{P} = (\delta_{Z_1} + \cdots + \delta_{Z_N})/N$ and the empirical loss is minimized over a subset ${\mathcal{S}}$ of the space of all possible estimators. The expected discrepancy between the resulting estimator and the one minimizing the true expected loss on the subspace is the "variance" of the method, and the expected discrepancy between this subspace-constrained estimator and and the optimal one is the "bias."
Figure 4: The function $V$ defining the SVM risk function.
Figure 5: Left: Original (training) data with three classes. Right: LDA scores, where the $x$ axis provides $\gamma_1$ and the $y$ axis $\gamma_2$.
...and 20 more figures

Theorems & Definitions (479)

remark 1
definition 1
theorem 1: Von Neumann
proof
remark 2
theorem 2
corollary 1
proof
theorem 3
proof
...and 469 more

Introduction to Machine Learning

TL;DR

Abstract

Introduction to Machine Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (25)

Theorems & Definitions (479)