Table of Contents
Fetching ...

Least Squares as Random Walks

Alexander Kostinski, Glenn Ierley, Sarah Kostinski

Abstract

Linear least squares (LLS) is perhaps the most common method of data analysis, dating back to Legendre, Gauss and Laplace. Framed as linear regression, LLS is also a backbone of mathematical statistics. Here we report on an unexpected new connection between LLS and random walks. To that end, we introduce the notion of a random walk based on a discrete sequence of data samples (data walk). We show that the slope of a straight line which annuls the net area under a residual data walk equals the one found by LLS. For equidistant data samples this result is exact and holds for an arbitrary distribution of steps.

Least Squares as Random Walks

Abstract

Linear least squares (LLS) is perhaps the most common method of data analysis, dating back to Legendre, Gauss and Laplace. Framed as linear regression, LLS is also a backbone of mathematical statistics. Here we report on an unexpected new connection between LLS and random walks. To that end, we introduce the notion of a random walk based on a discrete sequence of data samples (data walk). We show that the slope of a straight line which annuls the net area under a residual data walk equals the one found by LLS. For equidistant data samples this result is exact and holds for an arbitrary distribution of steps.

Paper Structure

This paper contains 2 sections, 27 equations, 2 figures.

Figures (2)

  • Figure 1: Linear function with additive Gaussian noise. Panel (a): Twenty four data samples generated via $y_k = x_k + n_k$ vs. $x_k$ for $k=1,\ldots, 24$. The $n_k$s are independent random samples drawn from a zero-mean unit variance Gaussian probability distribution denoted by ${\cal N}(0,1)$. The solid black line is a linear least squares (LLS) fit to this noisy data. The LLS slope estimate of $1.2$ differs from unity because of the sampling variability. Panel (b): same data samples as in (a) but shifted vertically because the arithmetic mean is subtracted from each sample. The upper abscissa gives the original values while the lower one represents the transition to consecutive integers, subsequently interpreted as steps of a data walk (see text).
  • Figure 2: Construction of a pinned data walk (DW). Panel (a): Data string $y_k - \bar{y}$ from panel (b) of Fig. \ref{['fig:rawdata']}, vs. step number $k=1, \ldots, 24$. The ordinates serve as increments (steps) to construct the DW via Eq. (\ref{['eq:PRW']}). There is an upward trend in the data sequence. Panel (b): The resulting DW has one interior zero, and an area under the curve of $58.3$ via Eq. (\ref{['eq:trend']}), indicating a large positive trend. The unit-slope trend, noise-free reference parabola (area of 50) given by Eq. (\ref{['eq:A5']}) is illustrated with crosses. Panel (c): Upon subtraction of the LLS fit from the data in panel (a), the residuals $r_k = y_k - \alpha x_k$, shifted by their sample mean $\bar{r}$, yield the DW bridge in panel (d). Panel (d): DW has five interior zeros, causing a perfect cancellation of signed areas and zero trend via Eq. (\ref{['eq:trend']}). For a symmetric random walk (not necessarily a bridge), the mean number of zero-crossings is $\sim \! \!\sqrt{N}$ (here $5 \approx \sqrt{24}$) while the most likely number of zero-crossings is zero papoulis2002random. The conclusion is that the LLS slope annuls the area under the residual DW and vice versa, as proven in the main text. This theorem holds for an arbitrary (e.g., asymmetric) distribution of DW steps.