Table of Contents
Fetching ...

Exploring Weight Balancing on Long-Tailed Recognition Problem

Naoya Hasegawa, Issei Sato

TL;DR

This work investigates weight balancing (WB) for long-tailed recognition, linking its effectiveness to neural collapse and the cone effect. It decomposes WB into five components—WD, MaxNorm, CE, CB, and two-stage learning—and shows that stage-1 WD+CE raise Fisher's discriminant ratio and suppress inter-class cosine similarities, while stage-2 WD+CB induces implicit logit adjustment by reallocating classifier weight norms toward tail classes. The authors prove that, under neural-collapse-like conditions, WB can be simplified to a one-stage approach using WD, feature regularization, and an ETF classifier with multiplicative LA, yielding comparable or superior performance with reduced training complexity. The findings provide a principled guideline for designing LTR training and demonstrate practical simplifications that improve accuracy and efficiency across multiple datasets and model families.

Abstract

Recognition problems in long-tailed data, in which the sample size per class is heavily skewed, have gained importance because the distribution of the sample size per class in a dataset is generally exponential unless the sample size is intentionally adjusted. Various methods have been devised to address these problems.Recently, weight balancing, which combines well-known classical regularization techniques with two-stage training, has been proposed. Despite its simplicity, it is known for its high performance compared with existing methods devised in various ways. However, there is a lack of understanding as to why this method is effective for long-tailed data. In this study, we analyze weight balancing by focusing on neural collapse and the cone effect at each training stage and found that it can be decomposed into an increase in Fisher's discriminant ratio of the feature extractor caused by weight decay and cross entropy loss and implicit logit adjustment caused by weight decay and class-balanced loss. Our analysis enables the training method to be further simplified by reducing the number of training stages to one while increasing accuracy. Code is available at https://github.com/HN410/Exploring-Weight-Balancing-on-Long-Tailed-Recognition-Problem.

Exploring Weight Balancing on Long-Tailed Recognition Problem

TL;DR

This work investigates weight balancing (WB) for long-tailed recognition, linking its effectiveness to neural collapse and the cone effect. It decomposes WB into five components—WD, MaxNorm, CE, CB, and two-stage learning—and shows that stage-1 WD+CE raise Fisher's discriminant ratio and suppress inter-class cosine similarities, while stage-2 WD+CB induces implicit logit adjustment by reallocating classifier weight norms toward tail classes. The authors prove that, under neural-collapse-like conditions, WB can be simplified to a one-stage approach using WD, feature regularization, and an ETF classifier with multiplicative LA, yielding comparable or superior performance with reduced training complexity. The findings provide a principled guideline for designing LTR training and demonstrate practical simplifications that improve accuracy and efficiency across multiple datasets and model families.

Abstract

Recognition problems in long-tailed data, in which the sample size per class is heavily skewed, have gained importance because the distribution of the sample size per class in a dataset is generally exponential unless the sample size is intentionally adjusted. Various methods have been devised to address these problems.Recently, weight balancing, which combines well-known classical regularization techniques with two-stage training, has been proposed. Despite its simplicity, it is known for its high performance compared with existing methods devised in various ways. However, there is a lack of understanding as to why this method is effective for long-tailed data. In this study, we analyze weight balancing by focusing on neural collapse and the cone effect at each training stage and found that it can be decomposed into an increase in Fisher's discriminant ratio of the feature extractor caused by weight decay and cross entropy loss and implicit logit adjustment caused by weight decay and class-balanced loss. Our analysis enables the training method to be further simplified by reducing the number of training stages to one while increasing accuracy. Code is available at https://github.com/HN410/Exploring-Weight-Balancing-on-Long-Tailed-Recognition-Problem.
Paper Structure (41 sections, 5 theorems, 25 equations, 6 figures, 16 tables)

This paper contains 41 sections, 5 theorems, 25 equations, 6 figures, 16 tables.

Key Result

Theorem 1

For all $(\mathbf{x}_i, y_i), (\mathbf{x}_j, y_j) \in \mathcal{D}$ s.t. $y_i \neq y_j$, if $\mathbf{W}$ is an ETF and there exists $\epsilon$ and $L$ s.t. $\left\|\frac{\partial \mathcal{\ell_{\mathrm{CE}}}}{\partial \bm{g}(\mathbf{x}_{i})}\right\|_2, \left\|\frac{\partial \mathcal{\ell_{\mathrm{CE} where $\delta \equiv \frac{1}{L}\frac{C-1}{C}\log\left(\frac{(C-1)(1-\epsilon)}{\epsilon}\right) \i

Figures (6)

  • Figure 1: Heatmaps showing average cosine similarities of training features between two classes. WD maintains high cosine similarity between the same classes and reduces the cosine similarity between the different classes.
  • Figure 2: Norm of mean per-class training features produced from models trained with each method. Features learned with methods with WD all demonstrate that the norms of the Many classes' features tend to be smaller than those of the Few classes.
  • Figure 3: Average forgetting scores per class when models are trained with each method. These indicate higher forgetting scores when the models are trained with CB; this is particularly noticeable in the Many classes without WD.
  • Figure 4: Results for mini-ImageNet-LT. (Left) Norm of mean per-class training features produced from models trained with each method. (Right) Heatmaps showing average cosine similarities of training features between two classes.
  • Figure 5: Norm ratio of mean per-class training features produced from models trained with each dataset and each method. Note that the vertical axis shows the ratio of the norm of the weights for each class with the one for the class of which sample size is the largest. Models trained with both methods have almost identical linear layer norms.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Theorem 1
  • Theorem 2
  • proof
  • Theorem 3
  • Lemma 1
  • proof
  • Lemma 2
  • proof : Proof of Lemma \ref{['lemma:old_main']}
  • proof : Proof of Theorem \ref{['theory:2']}