Table of Contents
Fetching ...

Heterogeneous Learning Rate Scheduling for Neural Architecture Search on Long-Tailed Datasets

Chenxia Tang

TL;DR

This work investigates applying Differentiable Architecture Search (DARTS) to long-tailed datasets and finds that standard re-sampling and re-weighting can harm NAS performance. It introduces a heterogeneous learning-rate scheduling strategy for architecture parameters within a Bilateral Branch Network (BBN) to stabilize DARTS training when handling imbalanced data, coupled with a symmetric mixing-ratio scheme for the two heads. Empirical results on long-tailed CIFAR-10 show that the proposed method (HLS) can achieve accuracy comparable to or better than the DARTS baseline, while re-sampling methods consistently degrade performance. The study highlights the importance of architecture-parameter LR control and balanced training dynamics in depth-aware NAS under class imbalance, and suggests careful data augmentation as a critical factor in DNAS for imbalanced scenarios.

Abstract

In this paper, we attempt to address the challenge of applying Neural Architecture Search (NAS) algorithms, specifically the Differentiable Architecture Search (DARTS), to long-tailed datasets where class distribution is highly imbalanced. We observe that traditional re-sampling and re-weighting techniques, which are effective in standard classification tasks, lead to performance degradation when combined with DARTS. To mitigate this, we propose a novel adaptive learning rate scheduling strategy tailored for the architecture parameters of DARTS when integrated with the Bilateral Branch Network (BBN) for handling imbalanced datasets. Our approach dynamically adjusts the learning rate of the architecture parameters based on the training epoch, preventing the disruption of well-trained representations in the later stages of training. Additionally, we explore the impact of branch mixing factors on the algorithm's performance. Through extensive experiments on the CIFAR-10 dataset with an artificially induced long-tailed distribution, we demonstrate that our method achieves comparable accuracy to using DARTS alone. And the experiment results suggest that re-sampling methods inherently harm the performance of the DARTS algorithm. Our findings highlight the importance of careful data augment when applying DNAS to imbalanced learning scenarios.

Heterogeneous Learning Rate Scheduling for Neural Architecture Search on Long-Tailed Datasets

TL;DR

This work investigates applying Differentiable Architecture Search (DARTS) to long-tailed datasets and finds that standard re-sampling and re-weighting can harm NAS performance. It introduces a heterogeneous learning-rate scheduling strategy for architecture parameters within a Bilateral Branch Network (BBN) to stabilize DARTS training when handling imbalanced data, coupled with a symmetric mixing-ratio scheme for the two heads. Empirical results on long-tailed CIFAR-10 show that the proposed method (HLS) can achieve accuracy comparable to or better than the DARTS baseline, while re-sampling methods consistently degrade performance. The study highlights the importance of architecture-parameter LR control and balanced training dynamics in depth-aware NAS under class imbalance, and suggests careful data augmentation as a critical factor in DNAS for imbalanced scenarios.

Abstract

In this paper, we attempt to address the challenge of applying Neural Architecture Search (NAS) algorithms, specifically the Differentiable Architecture Search (DARTS), to long-tailed datasets where class distribution is highly imbalanced. We observe that traditional re-sampling and re-weighting techniques, which are effective in standard classification tasks, lead to performance degradation when combined with DARTS. To mitigate this, we propose a novel adaptive learning rate scheduling strategy tailored for the architecture parameters of DARTS when integrated with the Bilateral Branch Network (BBN) for handling imbalanced datasets. Our approach dynamically adjusts the learning rate of the architecture parameters based on the training epoch, preventing the disruption of well-trained representations in the later stages of training. Additionally, we explore the impact of branch mixing factors on the algorithm's performance. Through extensive experiments on the CIFAR-10 dataset with an artificially induced long-tailed distribution, we demonstrate that our method achieves comparable accuracy to using DARTS alone. And the experiment results suggest that re-sampling methods inherently harm the performance of the DARTS algorithm. Our findings highlight the importance of careful data augment when applying DNAS to imbalanced learning scenarios.
Paper Structure (23 sections, 16 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 23 sections, 16 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: The architecture of BBN. After sampling outputs from two different data sources, they are simultaneously fed into the backbone. The output hidden layer vectors are then fed into two classification heads. Finally, they are mixed according to $\mu$ to obtain the mixed probability vector. The gradient flow is completely a reverse process, resulting in different magnitude of update of each component.
  • Figure 2: Training curve of simply combining BBN and DARTS. During training, the loss initially decreased, then experienced a strange increase, and subsequently decreased rapidly. However, contrary to this, when the training loss increased, the validation loss remained relatively unchanged; and when the training loss decreased, the validation performance started to decline. Typically, this would suggest overfitting, but judging from the training process, the class-sampling classifier head should actually be underfitting.
  • Figure 3: Visualization of the architecture as the epochs change. From top to bottom, they are: Backbone normal cell, Backbone reduction cell, instance-sampling head, and class-sampling head. It can be observed that the backbone does not stably converge even until the end. The instance-sampling head is consistent with mixing ratio, showing no updates in the later stages of training; the class-sampling head, on the other hand, exhibits the opposite behavior.
  • Figure 4: Accuracy versus mixing ratio
  • Figure 5: Weight versus normalized epoch. Ins Weight is identical to the mixing ratio $\mu$.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Proof 1