Table of Contents
Fetching ...

SliceFine: The Universal Winning-Slice Hypothesis for Pretrained Networks

Md Kowsher, Ali O. Polat, Ehsan Mohammady Ardehaly, Mehrdad Salehi, Zia Ghiasi, Prasanth Murali, Chen Chen

TL;DR

The paper addresses why tiny, randomly selected slices within pretrained networks can suffice for downstream adaptation. It introduces the Universal Winning Slice Hypothesis (UWSH), grounded in spectral balance across slice groups and high task energy in frozen backbones, to show that any sufficiently wide slice can be a local winning ticket and a small set of slices can form a global winning ticket. Building on this theory, SliceFine updates only moving slices across layers with zero additional parameters, achieving competitive accuracy against strong PEFT baselines while improving training speed, memory efficiency, and model compactness. Empirical results span language, vision, and video tasks, with ablations mapping how slice rank, switching intervals, and backbone quality influence performance, thereby offering a theoretically grounded, practical alternative to adapter- and prune-based approaches. The work bridges theory and practice, suggesting a universal slice-based pathway to parameter-efficient fine-tuning in large-scale pretrained models.

Abstract

This paper presents a theoretical framework explaining why fine tuning small, randomly selected subnetworks (slices) within pre trained models can be sufficient for downstream adaptation. We prove that pretrained networks exhibit a universal winning slice property arising from two phenomena: (1) spectral balance the eigenspectra of different weight matrix slices are remarkably similar; and (2) high task energy their backbone representations retain rich, task relevant features. This leads to the Universal Winning Slice Hypothesis, which provides a theoretical foundation for parameter efficient fine tuning (PEFT) in large scale models. Inspired by this, we propose SliceFine, a PEFT method that exploits this inherent redundancy by updating only selected slices of the original weights introducing zero new parameters, unlike adapter-based approaches. Empirically, SliceFine matches the performance of state of the art PEFT methods across language and vision tasks, while significantly improving training speed, memory efficiency, and model compactness. Our work bridges theory and practice, offering a theoretically grounded alternative to existing PEFT techniques.

SliceFine: The Universal Winning-Slice Hypothesis for Pretrained Networks

TL;DR

The paper addresses why tiny, randomly selected slices within pretrained networks can suffice for downstream adaptation. It introduces the Universal Winning Slice Hypothesis (UWSH), grounded in spectral balance across slice groups and high task energy in frozen backbones, to show that any sufficiently wide slice can be a local winning ticket and a small set of slices can form a global winning ticket. Building on this theory, SliceFine updates only moving slices across layers with zero additional parameters, achieving competitive accuracy against strong PEFT baselines while improving training speed, memory efficiency, and model compactness. Empirical results span language, vision, and video tasks, with ablations mapping how slice rank, switching intervals, and backbone quality influence performance, thereby offering a theoretically grounded, practical alternative to adapter- and prune-based approaches. The work bridges theory and practice, suggesting a universal slice-based pathway to parameter-efficient fine-tuning in large-scale pretrained models.

Abstract

This paper presents a theoretical framework explaining why fine tuning small, randomly selected subnetworks (slices) within pre trained models can be sufficient for downstream adaptation. We prove that pretrained networks exhibit a universal winning slice property arising from two phenomena: (1) spectral balance the eigenspectra of different weight matrix slices are remarkably similar; and (2) high task energy their backbone representations retain rich, task relevant features. This leads to the Universal Winning Slice Hypothesis, which provides a theoretical foundation for parameter efficient fine tuning (PEFT) in large scale models. Inspired by this, we propose SliceFine, a PEFT method that exploits this inherent redundancy by updating only selected slices of the original weights introducing zero new parameters, unlike adapter-based approaches. Empirically, SliceFine matches the performance of state of the art PEFT methods across language and vision tasks, while significantly improving training speed, memory efficiency, and model compactness. Our work bridges theory and practice, offering a theoretically grounded alternative to existing PEFT techniques.

Paper Structure

This paper contains 28 sections, 7 theorems, 42 equations, 12 figures, 9 tables, 1 algorithm.

Key Result

Lemma 2.1

Consider a pretrained layer $W^{(\ell)}$ partitioned into $k$ disjoint groups $\{W_g\}_{g=1}^k$. Let $\Sigma_g := W_g W_g^\top$ and let $\lambda_1(\Sigma_g) \ge \dots \ge \lambda_{d_\ell/k}(\Sigma_g)$ denote its eigenvalues in descending order. Although each group is anisotropic—its eigenvalues deca

Figures (12)

  • Figure 1: (Left) Winning Tickets. In a pretrained network, a randomly chosen slice of a layer $W^{(\ell)} \in \mathbb{R}^{d_\ell \times d_{\ell-1}}$acts as a local winning ticket: tuning only that slice lowers the loss while keeping the backbone frozen. A few such slices (row, column, or row-column) selected across layers constitute a global winning ticket. (Right) SliceFine. At step $t$, only a slice of the weight matrix $W^{(\ell)}$ is updated; all other entries remain fixed. Every $N$ steps, we activate a new slice at a different position for learning; the previously active slice retains its learned update but is frozen. Top: column sweep—the slice slides across columns. Bottom: row–column alternation—the slice alternates between a column block and a row block to cover complementary directions. Similarly, in row sweep—the slice slides across rows. This schedule updates only a tiny portion of the model at a time while gradually covering many regions; applying it across several layers yields a global winner.
  • Figure 2: Eigenvalue spectra of FFN, Key, Query, and Value weight matrices from different layers of a pretrained RoBERTa-base model. For each matrix, weights are partitioned into groups, and the eigenvalues of the within-group covariance $\Sigma_g=W_g^{(\ell)} W_g^{(\ell)\top}$ are plotted in descending order.
  • Figure 3: Empirical evidence for the robustness of slice selection strategies across tasks. (a) Rank vs. Accuracy: Increasing the slice rank improves accuracy up to a point, after which validation accuracy declines, indicating gradual overfitting. (b) Position vs. Accuracy: accuracy remains stable across slice positions, within $\pm 1\%$ of the anchor accuracy. (c) Wanda category ablations: accuracy is insensitive to whether slices are chosen from most important, less important, mixed, or random weights. (d) LTH comparison: even "bad" slices perform comparably to the "best" slices, supporting the winner-slice property—pretrained networks contain many capable subnetworks.
  • Figure 4: Comparison of PEFT methods on (a) PEFT model size, (b) peak memory, (c) throughput, and (d) total training time across ViT, VideoMAE, RoBERTa, and multiple datasets.
  • Figure 5: PCA/NTK agreement across layers. Each row shows (i) the centered representation kernel $K=\tilde{H}\tilde{H}^\top$, (ii) the reconstruction $PP^\top$ with $P=\tilde{H} V$, (iii) the absolute difference $\lvert K-PP^\top\rvert$, and (iv) cumulative explained variance from PCA on $\Sigma$ versus eigenvalues from $K$. Results are shown for RoBERTa-base layers 1, 5, and 11. Spectra and CEV curves overlap, confirming Lemma \ref{['lemma:pca_ntk']} across depth.
  • ...and 7 more figures

Theorems & Definitions (13)

  • Lemma 2.1: Spectral Balance Across Slices
  • Definition 2.2: Local Winner
  • Definition 2.3: Global Winner
  • Theorem 2.4: Universal Winning Ticket
  • Lemma 2.5: PCA Decomposition of the Representation/Linearized NTK Kernel
  • Corollary 2.6: Minimal Slice Rank from PCA/NTK Spectrum
  • proof : Proof of Theorem \ref{['thm:universal_winning_ticket']}
  • proof : Proof of Lemma \ref{['lemma:pca_ntk']}
  • Remark B.1: Relation to the NTK
  • Lemma D.1: Backbone energy & alignment condition for local winners
  • ...and 3 more