Table of Contents
Fetching ...

Is Oracle Pruning the True Oracle?

Sicheng Feng, Keda Tao, Huan Wang

TL;DR

This work questions the 35-year reliance on oracle pruning by empirically linking pruned-train loss to post-retraining performance across a wide spectrum of models, from LeNet derivatives to large multimodal models. It introduces a Kendall $\tau$-based correlation framework plus anomaly and counterexample metrics to assess validity, and applies it to 37K trained models spanning MNIST to TinyLLaVA-3.1B. Across modern networks and datasets (CIFAR, ImageNet, ViTs, MLLMs), the pruned-train loss shows weak or negative predictive power for final performance after retraining, signaling that oracle pruning is not a reliable foundation today. The authors argue that rising task complexity and the retraining process must be accounted for when designing pruning criteria, and they demonstrate that even simple, non-oracle baselines may outperform oracle-driven methods, suggesting a retraining-aware paradigm for pruning research.

Abstract

Oracle pruning, which selects unimportant weights by minimizing the pruned train loss, has been taken as the foundation for most neural network pruning methods for over 35 years, while few (if not none) have thought about how much the foundation really holds. This paper, for the first time, attempts to examine its validity on modern deep models through empirical correlation analyses and provide reflections on the field of neural network pruning. Specifically, for a typical pruning algorithm with three stages (pertaining, pruning, and retraining), we analyze the model performance correlation before and after retraining. Extensive experiments (37K models are trained) across a wide spectrum of models (LeNet5, VGG, ResNets, ViT, MLLM) and datasets (MNIST and its variants, CIFAR10/CIFAR100, ImageNet-1K, MLLM data) are conducted. The results lead to a surprising conclusion: on modern deep learning models, the performance before retraining is barely correlated with the performance after retraining. Namely, the weights selected by oracle pruning can hardly guarantee a good performance after retraining. This further implies that existing works using oracle pruning to derive pruning criteria may be groundless from the beginning. Further studies suggest the rising task complexity is one factor that makes oracle pruning invalid nowadays. Finally, given the evidence, we argue that the retraining stage in a pruning algorithm should be accounted for when developing any pruning criterion.

Is Oracle Pruning the True Oracle?

TL;DR

This work questions the 35-year reliance on oracle pruning by empirically linking pruned-train loss to post-retraining performance across a wide spectrum of models, from LeNet derivatives to large multimodal models. It introduces a Kendall -based correlation framework plus anomaly and counterexample metrics to assess validity, and applies it to 37K trained models spanning MNIST to TinyLLaVA-3.1B. Across modern networks and datasets (CIFAR, ImageNet, ViTs, MLLMs), the pruned-train loss shows weak or negative predictive power for final performance after retraining, signaling that oracle pruning is not a reliable foundation today. The authors argue that rising task complexity and the retraining process must be accounted for when designing pruning criteria, and they demonstrate that even simple, non-oracle baselines may outperform oracle-driven methods, suggesting a retraining-aware paradigm for pruning research.

Abstract

Oracle pruning, which selects unimportant weights by minimizing the pruned train loss, has been taken as the foundation for most neural network pruning methods for over 35 years, while few (if not none) have thought about how much the foundation really holds. This paper, for the first time, attempts to examine its validity on modern deep models through empirical correlation analyses and provide reflections on the field of neural network pruning. Specifically, for a typical pruning algorithm with three stages (pertaining, pruning, and retraining), we analyze the model performance correlation before and after retraining. Extensive experiments (37K models are trained) across a wide spectrum of models (LeNet5, VGG, ResNets, ViT, MLLM) and datasets (MNIST and its variants, CIFAR10/CIFAR100, ImageNet-1K, MLLM data) are conducted. The results lead to a surprising conclusion: on modern deep learning models, the performance before retraining is barely correlated with the performance after retraining. Namely, the weights selected by oracle pruning can hardly guarantee a good performance after retraining. This further implies that existing works using oracle pruning to derive pruning criteria may be groundless from the beginning. Further studies suggest the rising task complexity is one factor that makes oracle pruning invalid nowadays. Finally, given the evidence, we argue that the retraining stage in a pruning algorithm should be accounted for when developing any pruning criterion.

Paper Structure

This paper contains 17 sections, 4 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Analysis framework of this work. We study the validity of oracle pruning in this paper, by examining the correlation between the pruned train loss and the final test performance (test accuracy or test loss). We apply this analysis framework to a wide range of networks and datasets (from toy networks like LeNet5-Mini to very large ones like ViT-B/16 and TinyLLaVA-3.1B) in order to have a comprehensive evaluation. The key finding of this work is that on modern networks and datasets (starting from the CIFAR level), oracle pruning is invalid, to our surprise. This new finding may challenge the conventional belief in network pruning over the past 35 years.
  • Figure 2: Pruned train loss vs. final test accuracy on MNIST with LeNet5-Mini. The subcaptions correspond to the pruning rates of each image. The blue star indicates the oracle pruning result (the one with the smallest pruned train loss). The points with final test accuracy higher than the oracle pruning are marked in red (anomaly points), and those lower are marked in green.
  • Figure 3: Pruned train loss vs. final test accuracy with ResNet56 (on CIFAR10), VGG19 (on CIFAR100), and ResNet18 (on ImageNet-1K).
  • Figure 4: Pruned train loss vs. final test accuracy on the variants of MNIST dataset, with LeNet5-Mini network (pruning ratio 0.5, Conv1 layer). FMNIST and KMNIST are two drop-in replacements of MNIST, which are more complex. As seen, the correlation becomes weaker on more challenging datasets. See more discussions in Sec. \ref{['sec:why_oracle_pruning_ineffective']}.
  • Figure 5: Pruned train loss vs. final test accuracy on MNIST with different variants of LeNet5-Mini (pruning ratio 0.5, Conv1 layer). The original LeNet5-Mini (Base) has 5 layers (D5) and each layer has 10 neurons (W10). Here we change the model width and depth to obtain different variants. As seen, the correlation becomes weaker when pruning more complex networks. See more discussions in Sec. \ref{['sec:why_oracle_pruning_ineffective']}.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Definition 3.1: Validity of Oracle Pruning