Table of Contents
Fetching ...

The Cost of Robustness: Tighter Bounds on Parameter Complexity for Robust Memorization in ReLU Nets

Yujun Kim, Chaewon Moon, Chulhee Yun

TL;DR

This work characterizes the parameter-count cost of robust memorization in ReLU networks as a function of the robustness ratio $\rho=\mu/\epsilon_D$, providing tight upper and lower bounds across the full $(0,1)$ interval. It introduces a nuanced, regime-dependent picture: when $\rho$ is small, robust memorization costs match classical memorization ($\tilde{\Theta}(\sqrt{N})$) but as $\rho$ grows, the required parameter count increases, up to $\tilde{O}(Nd^2)$ in the largest-robustness regime. The authors develop novel tools—separation-preserving dimensionality reduction (a strengthened Johnson-Lindenstrauss lemma) and a grid-lattice mapping approach—to construct compact robust memorization schemes, and they extend the analysis to general $\ell_p$ norms. The results reveal a tight coupling between robustness and network complexity and offer a concrete pathway to design efficient robust memorization schemes, including sublinear-parameter constructions in certain $\rho$ regimes. Overall, the paper advances fundamental understanding of robustness costs in neural memorization and closes substantial gaps in the $\rho$-dependent parameter scaling.

Abstract

We study the parameter complexity of robust memorization for $\mathrm{ReLU}$ networks: the number of parameters required to interpolate any given dataset with $ε$-separation between differently labeled points, while ensuring predictions remain consistent within a $μ$-ball around each training sample. We establish upper and lower bounds on the parameter count as a function of the robustness ratio $ρ= μ/ ε$. Unlike prior work, we provide a fine-grained analysis across the entire range $ρ\in (0,1)$ and obtain tighter upper and lower bounds that improve upon existing results. Our findings reveal that the parameter complexity of robust memorization matches that of non-robust memorization when $ρ$ is small, but grows with increasing $ρ$.

The Cost of Robustness: Tighter Bounds on Parameter Complexity for Robust Memorization in ReLU Nets

TL;DR

This work characterizes the parameter-count cost of robust memorization in ReLU networks as a function of the robustness ratio , providing tight upper and lower bounds across the full interval. It introduces a nuanced, regime-dependent picture: when is small, robust memorization costs match classical memorization () but as grows, the required parameter count increases, up to in the largest-robustness regime. The authors develop novel tools—separation-preserving dimensionality reduction (a strengthened Johnson-Lindenstrauss lemma) and a grid-lattice mapping approach—to construct compact robust memorization schemes, and they extend the analysis to general norms. The results reveal a tight coupling between robustness and network complexity and offer a concrete pathway to design efficient robust memorization schemes, including sublinear-parameter constructions in certain regimes. Overall, the paper advances fundamental understanding of robustness costs in neural memorization and closes substantial gaps in the -dependent parameter scaling.

Abstract

We study the parameter complexity of robust memorization for networks: the number of parameters required to interpolate any given dataset with -separation between differently labeled points, while ensuring predictions remain consistent within a -ball around each training sample. We establish upper and lower bounds on the parameter count as a function of the robustness ratio . Unlike prior work, we provide a fine-grained analysis across the entire range and obtain tighter upper and lower bounds that improve upon existing results. Our findings reveal that the parameter complexity of robust memorization matches that of non-robust memorization when is small, but grows with increasing .

Paper Structure

This paper contains 67 sections, 47 theorems, 346 equations, 8 figures, 1 table.

Key Result

Theorem 3.1

Let $\rho \in (0,1)$. Suppose for any ${\mathcal{D}} \in {\bm{D}}_{d, N, 2}$, there exists a neural network $f \in {\mathcal{F}}_{d, P}$ that can $\rho$-robustly memorize ${\mathcal{D}}$. Then, the number of parameters $P$ must satisfy

Figures (8)

  • Figure 1: Summary of parameter bounds on a log-log scale when $d=\Theta(\sqrt N$). We omit constant factors in both axes. Solid blue and red curves show the sufficient (\ref{['thm:ub']}) and necessary (\ref{['thm:lb']}) numbers of parameters, respectively; the solid black curves are the best prior bounds. Light‐blue shading highlights our improvement in the upper bound, and light‐red shading highlights our improvement in the lower bound. The cross‐hatched area marks the remaining gap. Notably, this gap disappears in the smallest $\rho$ regime. The yellow and green dashed line denotes the first term (\ref{['prop:lb_width_p=2']}) and the second term (\ref{['prop:lb_vc_p=2']}) in \ref{['thm:lb']}, respectively.
  • Figure 2: In (a), blue balls have label 1; the red ball has label 2. (b) illustrates the distance between ${\mathrm{Null}}({\bm{W}}) \subset {\mathbb{R}}^3$ and the standard basis for ${\bm{W}} = 11-1$ with the first hidden layer width 1.
  • Figure 3: Separation-Preserving Projection
  • Figure 4: Grid-based Lattice Mapping.
  • Figure 5: Reduction of Shattering to Robust Memorization. The cross marks refer to the points to be shattered, and the circular dots refer to the points for robust memorization. The centers of robustness balls change with respect to the labels of the points to be shattered.
  • ...and 3 more figures

Theorems & Definitions (89)

  • Definition 2.1
  • Definition 2.2
  • Theorem 3.1
  • Proposition 3.1
  • Proposition 3.1
  • Definition 4.1
  • Theorem 4.2
  • Theorem A.1
  • proof
  • Proposition A.0
  • ...and 79 more