Why Fine-grained Labels in Pretraining Benefit Generalization?

Guan Zhe Hong; Yin Cui; Ariel Fuxman; Stanley Chan; Enming Luo

Why Fine-grained Labels in Pretraining Benefit Generalization?

Guan Zhe Hong, Yin Cui, Ariel Fuxman, Stanley Chan, Enming Luo

TL;DR

This work investigates why fine-grained pretraining labels improve generalization. It introduces a hierarchical multi-view data model and analyzes SGD dynamics of a two-layer ReLU CNN to show that coarse-grained pretraining learns only common features, yielding low error on easy samples but substantial error on hard ones, while fine-grained pretraining enables learning of rare features, reducing error on hard samples ($o(1)$) alongside easy ones. The theoretical claim is supported by empirical results on ImageNet21k and iNaturalist 2021, which show benefits of fine-grained pretraining when the label space is well-aligned with the downstream task and per-class sample counts are adequate, but can fail under extreme granularity or misalignment. Overall, the paper provides a formal mechanism—representation-label correspondence—for how label granularity shapes feature learning and downstream generalization, offering practical guidance on when and how fine-grained pretraining can be advantageous.

Abstract

Recent studies show that pretraining a deep neural network with fine-grained labeled data, followed by fine-tuning on coarse-labeled data for downstream tasks, often yields better generalization than pretraining with coarse-labeled data. While there is ample empirical evidence supporting this, the theoretical justification remains an open problem. This paper addresses this gap by introducing a "hierarchical multi-view" structure to confine the input data distribution. Under this framework, we prove that: 1) coarse-grained pretraining only allows a neural network to learn the common features well, while 2) fine-grained pretraining helps the network learn the rare features in addition to the common ones, leading to improved accuracy on hard downstream test samples.

Why Fine-grained Labels in Pretraining Benefit Generalization?

TL;DR

Abstract

Why Fine-grained Labels in Pretraining Benefit Generalization?

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (63)