Memorization Capacity for Additive Fine-Tuning with Small ReLU Networks
Jy-yong Sohn, Dohyun Kwon, Seoyeon An, Kangwook Lee
TL;DR
The paper introduces Fine-Tuning Capacity (FTC) to quantify the memorization-like limits of additive fine-tuning on pre-trained networks. It analyzes two concrete architectures: 2-layer and 3-layer ReLU networks used as side nets $g$ added to a frozen $f$, deriving tight bounds on the required neuron count $m$ to arbitrarily adjust $N$ labels among a dataset of size $K$. Specifically, 2-layer ReLU fine-tuning requires $m = \Theta(N)$ neurons, independent of $K$, while 3-layer ReLU fine-tuning reduces this to $m = \Theta(\sqrt{N})$ with bounds that may depend on $K$. The results bridge memorization capacity and fine-tuning, providing constructive upper-bound networks, lower-bound limitations, and extensions to deeper architectures, complemented by synthetic experiments that validate the square-root scaling. Overall, FTC offers a theoretical framework to understand the resource-efficiency of fine-tuning large pre-trained models.
Abstract
Fine-tuning large pre-trained models is a common practice in machine learning applications, yet its mathematical analysis remains largely unexplored. In this paper, we study fine-tuning through the lens of memorization capacity. Our new measure, the Fine-Tuning Capacity (FTC), is defined as the maximum number of samples a neural network can fine-tune, or equivalently, as the minimum number of neurons ($m$) needed to arbitrarily change $N$ labels among $K$ samples considered in the fine-tuning process. In essence, FTC extends the memorization capacity concept to the fine-tuning scenario. We analyze FTC for the additive fine-tuning scenario where the fine-tuned network is defined as the summation of the frozen pre-trained network $f$ and a neural network $g$ (with $m$ neurons) designed for fine-tuning. When $g$ is a ReLU network with either 2 or 3 layers, we obtain tight upper and lower bounds on FTC; we show that $N$ samples can be fine-tuned with $m=Θ(N)$ neurons for 2-layer networks, and with $m=Θ(\sqrt{N})$ neurons for 3-layer networks, no matter how large $K$ is. Our results recover the known memorization capacity results when $N = K$ as a special case.
