Table of Contents
Fetching ...

Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ bias

Shuofeng Zhang, Ard Louis

TL;DR

This work provides, for overparameterized linear regression with minimum-ℓ_p interpolation (p∈(1,2]), a closed-form, high-probability characterization of how the entire ℓ_r norm family (r∈[1,p]) scales with sample size. A simple dual-ray analysis reveals a competition between a signal spike and a bulk of null coordinates in X^⊤Y, yielding a data-dependent elbow n⋆ and a universal threshold r⋆=2(p−1) that splits norms into plateauing versus growing regimes with explicit exponents. The theory extends to diagonal linear networks by mapping initialization α to an effective p_eff(α), and experiments show DLNs inherit the same elbow/threshold structure, bridging explicit and implicit bias. These results imply that norm-based generalization diagnostics can be highly sensitive to the chosen r and underlying p-bias, providing practical guidance for selecting r and interpreting norm proxies in high-dimensional interpolation tasks.

Abstract

For overparameterized linear regression with isotropic Gaussian design and minimum-$\ell_p$ interpolator $p\in(1,2]$, we give a unified, high-probability characterization for the scaling of the family of parameter norms $ \\{ \lVert \widehat{w_p} \rVert_r \\}_{r \in [1,p]} $ with sample size. We solve this basic, but unresolved question through a simple dual-ray analysis, which reveals a competition between a signal *spike* and a *bulk* of null coordinates in $X^\top Y$, yielding closed-form predictions for (i) a data-dependent transition $n_\star$ (the "elbow"), and (ii) a universal threshold $r_\star=2(p-1)$ that separates $\lVert \widehat{w_p} \rVert_r$'s which plateau from those that continue to grow with an explicit exponent. This unified solution resolves the scaling of *all* $\ell_r$ norms within the family $r\in [1,p]$ under $\ell_p$-biased interpolation, and explains in one picture which norms saturate and which increase as $n$ grows. We then study diagonal linear networks (DLNs) trained by gradient descent. By calibrating the initialization scale $α$ to an effective $p_{\mathrm{eff}}(α)$ via the DLN separable potential, we show empirically that DLNs inherit the same elbow/threshold laws, providing a predictive bridge between explicit and implicit bias. Given that many generalization proxies depend on $\lVert \widehat {w_p} \rVert_r$, our results suggest that their predictive power will depend sensitively on which $l_r$ norm is used.

Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ bias

TL;DR

This work provides, for overparameterized linear regression with minimum-ℓ_p interpolation (p∈(1,2]), a closed-form, high-probability characterization of how the entire ℓ_r norm family (r∈[1,p]) scales with sample size. A simple dual-ray analysis reveals a competition between a signal spike and a bulk of null coordinates in X^⊤Y, yielding a data-dependent elbow n⋆ and a universal threshold r⋆=2(p−1) that splits norms into plateauing versus growing regimes with explicit exponents. The theory extends to diagonal linear networks by mapping initialization α to an effective p_eff(α), and experiments show DLNs inherit the same elbow/threshold structure, bridging explicit and implicit bias. These results imply that norm-based generalization diagnostics can be highly sensitive to the chosen r and underlying p-bias, providing practical guidance for selecting r and interpreting norm proxies in high-dimensional interpolation tasks.

Abstract

For overparameterized linear regression with isotropic Gaussian design and minimum- interpolator , we give a unified, high-probability characterization for the scaling of the family of parameter norms with sample size. We solve this basic, but unresolved question through a simple dual-ray analysis, which reveals a competition between a signal *spike* and a *bulk* of null coordinates in , yielding closed-form predictions for (i) a data-dependent transition (the "elbow"), and (ii) a universal threshold that separates 's which plateau from those that continue to grow with an explicit exponent. This unified solution resolves the scaling of *all* norms within the family under -biased interpolation, and explains in one picture which norms saturate and which increase as grows. We then study diagonal linear networks (DLNs) trained by gradient descent. By calibrating the initialization scale to an effective via the DLN separable potential, we show empirically that DLNs inherit the same elbow/threshold laws, providing a predictive bridge between explicit and implicit bias. Given that many generalization proxies depend on , our results suggest that their predictive power will depend sensitively on which norm is used.

Paper Structure

This paper contains 55 sections, 10 theorems, 227 equations, 18 figures, 2 algorithms.

Key Result

Theorem 3.1

Fix $p\in(1,2]$, set $q=\frac{p}{p-1}$, and take $r\in[1,p]$. Assume Let $w^\star$ have support $S$ with $|S|=s$, and let Write $W_q:=\|w^\star\|_q^q$ and $m_t:=\mathbb{E}|Z|^t$ for $Z\sim\mathcal{N}(0,1)$. Define the ray scale$t_\star$ via Then, w.h.p. (see Remark rem:growth-prob), Introduce the transition scale In the two extremes, we obtain: Spike-dominated ($n\gg n_\star$): Bulk-dominated

Figures (18)

  • Figure 1: Single spike $w^\star=e_1$; explicit minimum‑$\ell_p$ interpolation. Ordering across $r$ and the presence/absence of elbows follow Corollary \ref{['cor:e1']}; the bulk panels rise like $n^{1/2}$ and the spike‑side panels plateau for $r>2(p{-}1)$, consistent with equation \ref{['eq:SD']}-equation \ref{['eq:BD']}.
  • Figure 2: Flat $w^\star$ ($s=50$); explicit minimum‑$\ell_p$ interpolation. The scaling rules mirror the flat‑support corollary: bulk growth persists until a larger transition scale, while spike‑side $r$ values plateau; absolute levels are comparable to the single‑spike case, as predicted by the plateau formulas.
  • Figure 3: Single spike $w^\star=e_1$; diagonal linear network (DLN). After calibrating $\alpha$ to $p_{\mathrm{eff}}(\alpha)$, the regime structure matches the explicit $p$ case: smaller $\alpha$ exhibits earlier spike dominance and plateaus for $r>2(p{-}1)$; larger $\alpha$ remains bulk‑dominated with $n^{1/2}$‑like growth.
  • Figure 4: Flat $w^\star$ ($s=50$); diagonal linear network (DLN). The same scaling rules hold, but the elbow appears at larger $n$---in line with the flat‑support transition scale---while absolute $\ell_r$ magnitudes remain comparable to the single‑spike case.
  • Figure S1: Slope-matching map $\alpha\mapsto p_{\mathrm{eff}}(\alpha)$ (Algorithm \ref{['alg:alpha-to-p']}), obtained by fitting the $k$‑sparse scaling of $Q_\alpha(\beta^{(k)})$ against the exact $k^{\,1-p/2}$ scaling of $\|\beta^{(k)}\|_p^p$. Target points ($p\!\in\!\{1.1,1.5,1.9\}$) are annotated; their corresponding $\alpha$ are solved by Algorithm \ref{['alg:bisection']}.
  • ...and 13 more figures

Theorems & Definitions (24)

  • Theorem 3.1: $\ell_r$ scaling of minimum-$\ell_p$ interpolators
  • Remark 3.2: Dual viewpoint
  • proof : Proof sketch
  • Corollary 4.1: Single spike
  • Corollary 4.2: Flat support
  • Remark A.1: Standing assumptions and probability shorthand
  • Theorem A.2: Theorem \ref{['thm:main_compact']} restated
  • Remark A.3: When the third term is absorbed
  • Remark A.4: Boundary $p=2$
  • Lemma A.5: norm of $Y$
  • ...and 14 more