Table of Contents
Fetching ...

Why Fine-grained Labels in Pretraining Benefit Generalization?

Guan Zhe Hong, Yin Cui, Ariel Fuxman, Stanley Chan, Enming Luo

TL;DR

This work investigates why fine-grained pretraining labels improve generalization. It introduces a hierarchical multi-view data model and analyzes SGD dynamics of a two-layer ReLU CNN to show that coarse-grained pretraining learns only common features, yielding low error on easy samples but substantial error on hard ones, while fine-grained pretraining enables learning of rare features, reducing error on hard samples ($o(1)$) alongside easy ones. The theoretical claim is supported by empirical results on ImageNet21k and iNaturalist 2021, which show benefits of fine-grained pretraining when the label space is well-aligned with the downstream task and per-class sample counts are adequate, but can fail under extreme granularity or misalignment. Overall, the paper provides a formal mechanism—representation-label correspondence—for how label granularity shapes feature learning and downstream generalization, offering practical guidance on when and how fine-grained pretraining can be advantageous.

Abstract

Recent studies show that pretraining a deep neural network with fine-grained labeled data, followed by fine-tuning on coarse-labeled data for downstream tasks, often yields better generalization than pretraining with coarse-labeled data. While there is ample empirical evidence supporting this, the theoretical justification remains an open problem. This paper addresses this gap by introducing a "hierarchical multi-view" structure to confine the input data distribution. Under this framework, we prove that: 1) coarse-grained pretraining only allows a neural network to learn the common features well, while 2) fine-grained pretraining helps the network learn the rare features in addition to the common ones, leading to improved accuracy on hard downstream test samples.

Why Fine-grained Labels in Pretraining Benefit Generalization?

TL;DR

This work investigates why fine-grained pretraining labels improve generalization. It introduces a hierarchical multi-view data model and analyzes SGD dynamics of a two-layer ReLU CNN to show that coarse-grained pretraining learns only common features, yielding low error on easy samples but substantial error on hard ones, while fine-grained pretraining enables learning of rare features, reducing error on hard samples () alongside easy ones. The theoretical claim is supported by empirical results on ImageNet21k and iNaturalist 2021, which show benefits of fine-grained pretraining when the label space is well-aligned with the downstream task and per-class sample counts are adequate, but can fail under extreme granularity or misalignment. Overall, the paper provides a formal mechanism—representation-label correspondence—for how label granularity shapes feature learning and downstream generalization, offering practical guidance on when and how fine-grained pretraining can be advantageous.

Abstract

Recent studies show that pretraining a deep neural network with fine-grained labeled data, followed by fine-tuning on coarse-labeled data for downstream tasks, often yields better generalization than pretraining with coarse-labeled data. While there is ample empirical evidence supporting this, the theoretical justification remains an open problem. This paper addresses this gap by introducing a "hierarchical multi-view" structure to confine the input data distribution. Under this framework, we prove that: 1) coarse-grained pretraining only allows a neural network to learn the common features well, while 2) fine-grained pretraining helps the network learn the rare features in addition to the common ones, leading to improved accuracy on hard downstream test samples.

Paper Structure

This paper contains 46 sections, 28 theorems, 213 equations, 6 figures, 8 tables.

Key Result

Theorem 5.1

(Summary). Let the number of subclasses be lower-bounded: $k_y \ge \text{polylog}(d)$. With high probability, with proper choice of step size, there exists a time $T^* \in \text{poly}(d)$ such that for any $T \in [T^*, \text{poly}(d)]$, the training loss is upper bounded according to Moreover, for an easy test sample $(\boldsymbol{X}_{\text{easy}},y)$, the probability of making a classification m

Figures (6)

  • Figure 1: The goal of this paper is to provide a theoretical justification of why fine-grained labels in pre-training benefit generalization.
  • Figure 2: ImageNet21k$\to$ImageNet1k transfer using a ViT-B/16 model. [Blue]: pretrained on the WordNet hierarchy of ImageNet21k, finetuned on ImageNet1k. [Red]: baseline, trained and evaluated on ImageNet1k.
  • Figure 3: A simplified symbolic representation of the cat versus dog problem.
  • Figure 4: Illustration of features and patches.
  • Figure 5: In-dataset transfer. ResNet34 validation error (with standard deviation) of finetuning on 11 superclasses of iNaturalist 2021, pretrained on various label hierarchies. The manual hierarchy outperforms the baseline and every other hierarchy, and exhibits a U-shaped curve.
  • ...and 1 more figures

Theorems & Definitions (63)

  • Definition 4.1: Features
  • Definition 4.2: Input patches
  • Definition 4.3: Source dataset's label mapping
  • Definition 4.4: Source training set
  • Theorem 5.1: Coarse-label training: baseline
  • Theorem 5.2: Fine-grained-label training
  • Remark
  • Remark
  • Remark
  • Definition C.1
  • ...and 53 more