Table of Contents
Fetching ...

Learning in High Dimension Always Amounts to Extrapolation

Randall Balestriero, Jerome Pesenti, Yann LeCun

TL;DR

This paper challenges the view that interpolation underpins generalization in high dimensions by showing that new samples almost surely fall outside the training convex hull unless dataset sizes grow exponentially with the convex-hull dimension d*. It combines theoretical results on the convex-position problem with extensive experiments on synthetic data and real-world datasets (and their embeddings) to demonstrate that interpolation is exceedingly rare in high dimensions and that real-world performance largely occurs in extrapolation. The findings question the validity of using interpolation as a proxy for generalization and call for revised geometric definitions of interpolation/extrapolation tailored to high-dimensional settings. Together, the work highlights fundamental limits on data efficiency and has implications for how we evaluate and interpret high-dimensional learning systems.

Abstract

The notion of interpolation and extrapolation is fundamental in various fields from deep learning to function approximation. Interpolation occurs for a sample $x$ whenever this sample falls inside or on the boundary of the given dataset's convex hull. Extrapolation occurs when $x$ falls outside of that convex hull. One fundamental (mis)conception is that state-of-the-art algorithms work so well because of their ability to correctly interpolate training data. A second (mis)conception is that interpolation happens throughout tasks and datasets, in fact, many intuitions and theories rely on that assumption. We empirically and theoretically argue against those two points and demonstrate that on any high-dimensional ($>$100) dataset, interpolation almost surely never happens. Those results challenge the validity of our current interpolation/extrapolation definition as an indicator of generalization performances.

Learning in High Dimension Always Amounts to Extrapolation

TL;DR

This paper challenges the view that interpolation underpins generalization in high dimensions by showing that new samples almost surely fall outside the training convex hull unless dataset sizes grow exponentially with the convex-hull dimension d*. It combines theoretical results on the convex-position problem with extensive experiments on synthetic data and real-world datasets (and their embeddings) to demonstrate that interpolation is exceedingly rare in high dimensions and that real-world performance largely occurs in extrapolation. The findings question the validity of using interpolation as a proxy for generalization and call for revised geometric definitions of interpolation/extrapolation tailored to high-dimensional settings. Together, the work highlights fundamental limits on data efficiency and has implications for how we evaluate and interpret high-dimensional learning systems.

Abstract

The notion of interpolation and extrapolation is fundamental in various fields from deep learning to function approximation. Interpolation occurs for a sample whenever this sample falls inside or on the boundary of the given dataset's convex hull. Extrapolation occurs when falls outside of that convex hull. One fundamental (mis)conception is that state-of-the-art algorithms work so well because of their ability to correctly interpolate training data. A second (mis)conception is that interpolation happens throughout tasks and datasets, in fact, many intuitions and theories rely on that assumption. We empirically and theoretically argue against those two points and demonstrate that on any high-dimensional (100) dataset, interpolation almost surely never happens. Those results challenge the validity of our current interpolation/extrapolation definition as an indicator of generalization performances.

Paper Structure

This paper contains 6 sections, 4 theorems, 4 equations, 5 figures, 1 table.

Key Result

Theorem 1

Given a $d$-dimensional dataset $\boldsymbol{X}\triangleq \{\boldsymbol{x}_1,\dots,\boldsymbol{x}_N \}$ with i.i.d. samples uniformly drawn from an hyperball, the probability that a new sample $\boldsymbol{x}$ is in interpolation regime (recall Def. def:interpolation) has the following asymptotic be

Figures (5)

  • Figure 1: Depiction of the evolution of the probability that a new sample is in interpolation regime (y-axis, $p(\boldsymbol{x} \in \text{Hull}(\boldsymbol{X}))$) given increasing dataset size (x-axis, $N$) seen in logarithmic scale, and for various ambient space dimensions ($d$) based on Monte-Carlo estimates on $500,000$ trials. On the left, the data is sampled from a Gaussian density $\boldsymbol{x}_i\sim \mathcal{N}(0,I_d)$ while in the middle, the data is sampled from a nonlinear continuous manifold with intrinsic dimension of $1$ (see Fig. \ref{['fig:manifold_data']} for details on the manifold data) and on the right, the data is sampled from a Gaussian density that lives in an affine subspace of constant dimension $4$ (while the ambient dimension increases). It is clear from those figures that in order to maintain a constant probability to be in interpolation regime, the training set size has to increase exponentially with $d^*$ regardless of the underlying intrinsic manifold dimension where $d^*$ is the dimension of the lowest dimensional affine subspace including the entire data manifold i.e. the convex hull dimension.
  • Figure 2: Depiction of the manifold data samples used for the middle plot of Figure. \ref{['fig:evolution_toy']} with ${\rm dim}=5$ on the left and ${\rm dim}=3$ on the right. In all cases, the intrinsic dimension of this dataset is $1$, the latent coordinate ($z$) that governs the data ($\boldsymbol{x}(z)$) is depicted on the top row while the manifold samples in the ambient space are depicted in the bottom row. This manifold is continuous, nonlinear and piecewise smooth, and corresponds to walking around the simplex.
  • Figure 3: Depiction of the proportion of the test set that is in interpolation of the training set for MNIST ( top), CIFAR ( middle) and Imagenet ( bottom) as a function of the number of selected dimensions. We propose two settings ( blue) selecting increasingly large central patches (some cases consist of irregular patches for intermediate dimension values) and ( red) smoothing-subsampling the original images (some cases consist of irregular images for intermediate dimension values). Note that the blue line is always decreasing with $d$, and that $d=147$ (right of the x-axis) represents 19% of MNIST total number of dimensions, 5% for CIFAR and less than 1% for Imagenet. As can be seen throughout those settings the proportion of the test set that is in interpolation regime decreases exponentially fast with respect to the number of dimensions ultimately becoming negligible well prior reaching the full data dimensionality. The different slopes of those curves can be explained by the smallest dimensional affine space containing each type of data of (see Tab. \ref{['tab:test_set']}).
  • Figure 4: Depiction of levels ($90\%$ to $99\%$ from light to dark) of explained variance from a Principal Component Analysis model for varying sub-images dimensions ($1$ to $147$) x-axis based on the number of considered components ( y-axis). The sub-images of dimension $d$ are obtained either by selecting the central spatial dimensions ( blue, top-row) or by smoothing and subsampling ( red, bottom-row) as per Fig. \ref{['fig:evolution']}. From this, it is clear that for each sub-image dimension ($d$), the smallest dimensional affine subspace containing the data reduces when going from MNIST to CIFAR10 to IMAGENET leading to the different slopes observed in Fig. \ref{['fig:evolution']} (recall Fig. \ref{['fig:evolution_toy']}).
  • Figure 5: Depiction of various nonlinear dimensionality reduction techniques applied onto a synthetic dataset containing all the hypercube vertices for an hypercube of dimension $8$ ( left) and $10$ ( right). Coloring goes from blue for the vertex at position $(1,\dots,1)$ to green for the vertex at position $(-1,\dots,-1)$ in a linear manner. This data is chosen since each point/vertex is in extrapolation regime from all the other points/vertices. However, existing techniques for dimensionality reduction primarily focus on preserving local geometric information. As a result, regardless of the employed dimensionality reduction algorithm, the interpolation/extrapolation information is lost as can be seen in all the proposed subplots. This can lead to hazardous assumptions and conclusions.

Theorems & Definitions (6)

  • Definition 1
  • Theorem 1: barany1988shape
  • Definition 2: Convex position problem
  • Theorem 2: valtr1995probabilityvaltr1996probability
  • Theorem 3: buchta1986conjecture
  • Theorem 4: kabluchko2020absorption