Table of Contents
Fetching ...

The Rules-and-Facts Model for Simultaneous Generalization and Memorization in Neural Networks

Gabriele Farné, Fabrizio Boncoraglio, Lenka Zdeborová

Abstract

A key capability of modern neural networks is their capacity to simultaneously learn underlying rules and memorize specific facts or exceptions. Yet, theoretical understanding of this dual capability remains limited. We introduce the Rules-and-Facts (RAF) model, a minimal solvable setting that enables precise characterization of this phenomenon by bridging two classical lines of work in the statistical physics of learning: the teacher-student framework for generalization and Gardner-style capacity analysis for memorization. In the RAF model, a fraction $1 - \varepsilon$ of training labels is generated by a structured teacher rule, while a fraction $\varepsilon$ consists of unstructured facts with random labels. We characterize when the learner can simultaneously recover the underlying rule - allowing generalization to new data - and memorize the unstructured examples. Our results quantify how overparameterization enables the simultaneous realization of these two objectives: sufficient excess capacity supports memorization, while regularization and the choice of kernel or nonlinearity control the allocation of capacity between rule learning and memorization. The RAF model provides a theoretical foundation for understanding how modern neural networks can infer structure while storing rare or non-compressible information.

The Rules-and-Facts Model for Simultaneous Generalization and Memorization in Neural Networks

Abstract

A key capability of modern neural networks is their capacity to simultaneously learn underlying rules and memorize specific facts or exceptions. Yet, theoretical understanding of this dual capability remains limited. We introduce the Rules-and-Facts (RAF) model, a minimal solvable setting that enables precise characterization of this phenomenon by bridging two classical lines of work in the statistical physics of learning: the teacher-student framework for generalization and Gardner-style capacity analysis for memorization. In the RAF model, a fraction of training labels is generated by a structured teacher rule, while a fraction consists of unstructured facts with random labels. We characterize when the learner can simultaneously recover the underlying rule - allowing generalization to new data - and memorize the unstructured examples. Our results quantify how overparameterization enables the simultaneous realization of these two objectives: sufficient excess capacity supports memorization, while regularization and the choice of kernel or nonlinearity control the allocation of capacity between rule learning and memorization. The RAF model provides a theoretical foundation for understanding how modern neural networks can infer structure while storing rare or non-compressible information.

Paper Structure

This paper contains 71 sections, 198 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: Generalization--memorization trade-off induced by regularization for the RAF model at fraction of facts $\varepsilon=0.1$ and sample complexity $\alpha=2.0/(1-\varepsilon)$. Left: KRR, square loss. Right: SVM, hinge loss. Each plot shows the parametric curve $\lambda \mapsto (\mathcal{E}_{\rm gen}(\lambda),\mathcal{E}_{\rm mem}(\lambda))$ for different models: a linear perceptron and kernel regression with kernels corresponding to erf and ReLU in Eq. \ref{['eq:kernels-example']}, together with the Bayes-optimal generalization baseline $\mathcal{E}_{\rm gen}^{\rm BO}=0.2008$ (bold black dashed vertical line). For the square loss, the optimal generalization error is the same $\mathcal{E}_{\rm gen}^{\rm opt,square}=0.2084$ for all three depicted models (thin gray dotted vertical line). Endpoints correspond to $\lambda\to 0^+$ (triangle) and $\lambda\to+\infty$ (square). For the hinge loss, the minimum test errors are $\mathcal{E}_{\rm gen}^{\rm opt,hinge}=0.2094$ (perceptron), $\mathcal{E}_{\rm gen}^{\rm opt,hinge}=0.2068$ (erf), $\mathcal{E}_{\rm gen}^{\rm opt, hinge}=0.2031$ (relu).
  • Figure 2: Finite-width random features vs. kernel limit. Parametric memorization-generalization trade-off curves $\lambda\mapsto\bigl(\mathcal{E}_{\rm gen}(\lambda),\mathcal{E}_{\rm mem}(\lambda)\bigr)$ for increasing model widths $\kappa$, showing convergence to the $\kappa \to \infty$ kernel prediction. Data model parameters: $\varepsilon = 0.1,\, \alpha(1-\varepsilon)=2$. The endpoints are numerically obtained in the random features case.
  • Figure 3: Generalization and memorization dependence on the angle $\gamma = \arctan\!\left( \mu_1 / \mu_\star \right)$. KRR (square loss) in the top panels, SVM (hinge loss) in the bottom panels. Left panels: $\mathcal{E}_{\rm gen}(\lambda \to 0^+) - \min_\gamma \mathcal{E}_{\rm gen}^{\rm square}(\lambda_\mathrm{opt})$; center panels: $\mathcal{E}_{\rm gen}(\lambda_{\rm opt}) - \min_\gamma \mathcal{E}_{\rm gen}^{\rm square}(\lambda_\mathrm{opt})$; right panels: $\mathcal{E}_{\rm mem}(\lambda_{\rm opt})$. The fraction of facts in the training set is fixed to $\varepsilon = 0.2$. The samples complexities are $\alpha \in \{2, 4, 10, 20\}$ in each panel. The dashed black vertical line in the upper panels is the optimal angle $\gamma^{\rm opt}_{\rm mem}(\varepsilon)$ in Eq. \ref{['eq:angle_opt_mem']}, for which, for the square loss, both minimum generalization and perfect memorization are simultaneously reached. For the hinge loss (bottom panels), the angle where $\mathcal{E}_\mathrm{gen}^\mathrm{hinge}(\lambda_\mathrm{opt})$ is minimal is marked by a cross, while the angle where $\mathcal{E}_\mathrm{gen}^\mathrm{hinge}(\lambda\to0^+)$ is minimum is marked by a dot. The table summarizes the Bayes-optimal error, the minimum generalization for the square loss (that coincides for $\lambda \to 0^+$ and $\lambda_{\rm opt}$), and the minimum generalization for the hinge loss -- both for $\lambda \to 0^+$ and $\lambda_{\rm opt}$.
  • Figure 4: Memorization-generalization trade-off curves $\lambda \mapsto (\mathcal{E}_{\mathrm{gen}}(\lambda),\mathcal{E}_{\mathrm{mem}}(\lambda))$ for kernel methods in the RAF model at $\varepsilon=0.1$ and $\alpha(1-\varepsilon)=2.0$. The left panel reports square loss, while the right panel reports hinge loss. Curves are shown for representative kernel geometries parameterized by the angles $\gamma$\ref{['eq:angle']}. For square loss (left), we display the curve for the optimal angle $\gamma_{\rm mem}^{\rm opt}= 0.8011$, in Eq. \ref{['eq:angle_opt_mem']}. For hinge loss (right), we display the angle achieving optimal generalization $\gamma_{\rm gen}^{\rm opt,hinge}=0.9774$. For both losses, we also display one representative lower value of the angle, and another higher one. Triangles mark the limit $\lambda\to0^{+}$, squares mark the limit $\lambda\to+\infty$, the vertical dashed line indicates the Bayes-optimal generalization baseline and the vertical dotted line indicates the optimal generalization error for the square loss.
  • Figure 5: Qualitative comparison between theory for the RAF model (upper left panel) and experiments on real data, namely the CIFAR10-RAF task (lower and right panels). In all panels, we use RBF kernel ridge regression with $\alpha = 4.0$ and $\varepsilon = 0.2$. The upper panels show the parametric memorization--generalization curves $\lambda \mapsto (\mathcal{E}_{\mathrm{gen}}(\lambda), \mathcal{E}_{\mathrm{mem}}(\lambda))$. The kernel bandwidth $\eta$ for CIFAR10-RAF is selected based on the lower panels: generalization error as a function of $\eta$ at optimal regularization (left), with a minimum around $\eta \approx 4$ (red dashed line), and at small regularization (right), with a minimum around $\eta \approx 6.75$ (purple dashed line). The parametric curves are then shown for these two values of $\eta$, together with one larger and one smaller representative value. The RAF model exhibits qualitatively similar behavior as a function of $\eta$, with the corresponding values shifted.
  • ...and 10 more figures