Table of Contents
Fetching ...

FastFace: Fast-converging Scheduler for Large-scale Face Recognition Training with One GPU

Xueyuan Gong, Zhiquan Liu, Yain-Whar Si, Xiaochen Yuan, Ke Wang, Xiaoxiang Liu, Cong Lin, Xinyuan Zhang

TL;DR

FastFace introduces a fast-converging LR scheduler for large-scale face recognition training that operates with $O(1)$ per-iteration cost. It combines Exponential Moving Average smoothing with a Haar Convolutional Kernel to detect stagnation in the loss and then applies a recurrence-based LR adjustment to promptly exit stationary phases. The method reduces converging time from typically $20$ epochs to about $5$ on a single GPU, with accuracy loss under about $1\%$, enabling feasible large-scale FR training without extensive hardware. Extensive experiments on MS1MV2/3 and WebFace variants demonstrate substantial efficiency gains while maintaining competitive accuracy against state-of-the-art multi-GPU approaches.

Abstract

Computing power has evolved into a foundational and indispensable resource in the area of deep learning, particularly in tasks such as Face Recognition (FR) model training on large-scale datasets, where multiple GPUs are often a necessity. Recognizing this challenge, some FR methods have started exploring ways to compress the fully-connected layer in FR models. Unlike other approaches, our observations reveal that without prompt scheduling of the learning rate (LR) during FR model training, the loss curve tends to exhibit numerous stationary subsequences. To address this issue, we introduce a novel LR scheduler leveraging Exponential Moving Average (EMA) and Haar Convolutional Kernel (HCK) to eliminate stationary subsequences, resulting in a significant reduction in converging time. However, the proposed scheduler incurs a considerable computational overhead due to its time complexity. To overcome this limitation, we propose FastFace, a fast-converging scheduler with negligible time complexity, i.e. O(1) per iteration, during training. In practice, FastFace is able to accelerate FR model training to a quarter of its original time without sacrificing more than 1% accuracy, making large-scale FR training feasible even with just one single GPU in terms of both time and space complexity. Extensive experiments validate the efficiency and effectiveness of FastFace. The code is publicly available at: https://github.com/amoonfana/FastFace

FastFace: Fast-converging Scheduler for Large-scale Face Recognition Training with One GPU

TL;DR

FastFace introduces a fast-converging LR scheduler for large-scale face recognition training that operates with per-iteration cost. It combines Exponential Moving Average smoothing with a Haar Convolutional Kernel to detect stagnation in the loss and then applies a recurrence-based LR adjustment to promptly exit stationary phases. The method reduces converging time from typically epochs to about on a single GPU, with accuracy loss under about , enabling feasible large-scale FR training without extensive hardware. Extensive experiments on MS1MV2/3 and WebFace variants demonstrate substantial efficiency gains while maintaining competitive accuracy against state-of-the-art multi-GPU approaches.

Abstract

Computing power has evolved into a foundational and indispensable resource in the area of deep learning, particularly in tasks such as Face Recognition (FR) model training on large-scale datasets, where multiple GPUs are often a necessity. Recognizing this challenge, some FR methods have started exploring ways to compress the fully-connected layer in FR models. Unlike other approaches, our observations reveal that without prompt scheduling of the learning rate (LR) during FR model training, the loss curve tends to exhibit numerous stationary subsequences. To address this issue, we introduce a novel LR scheduler leveraging Exponential Moving Average (EMA) and Haar Convolutional Kernel (HCK) to eliminate stationary subsequences, resulting in a significant reduction in converging time. However, the proposed scheduler incurs a considerable computational overhead due to its time complexity. To overcome this limitation, we propose FastFace, a fast-converging scheduler with negligible time complexity, i.e. O(1) per iteration, during training. In practice, FastFace is able to accelerate FR model training to a quarter of its original time without sacrificing more than 1% accuracy, making large-scale FR training feasible even with just one single GPU in terms of both time and space complexity. Extensive experiments validate the efficiency and effectiveness of FastFace. The code is publicly available at: https://github.com/amoonfana/FastFace
Paper Structure (23 sections, 13 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 13 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: The loss curve of training ResNet100 on MS1MV3. All curves are normalized to $[0,1]$ for putting in one figure. With FastFace, the loss after training $2e5$ steps is close to that after training $8e5$ steps without FastFace. Thus, only $1/4$ of its original training time is required, which enables us to train models with $1\times$GPU in a feasible time while preserving the accuracy.
  • Figure 2: Smoothing the loss curve by EMA. As shown in (a), the original loss curve is hard to be analyzed since it contains too much noise. Thus, as shown in (b), we adopt EMA to smooth it. Note all curves are normalized
  • Figure 3: Examples of $\mathbf{D}$ and $\mathbf{D}^{E}$. As shown in (a), it is hard to find a relationship between $\mathbf{L}$ and $\mathbf{D}$. By contrast, as shown in (b), $\mathbf{D}$ is smoothed by EMA and denoted as $\mathbf{D}^{E}$. It is clear to observe a spike on $\mathbf{D}^{E}$ every time when there is a decline on $\mathbf{L}$. Note all curves are normalized
  • Figure 4: Grid searching for the initial learning rate $\gamma_{0}$ and decay factor $\delta$. In (a), it shows the heatmap of $\gamma_{0}$ and $\delta$, where TAR@FAR$=1e-4$ on IJB-C is reported as the accuracy ($\%$). In (b), it illustrates the impact of the decay factor $\delta$, where $\delta=8$ for $\mathbf{\Gamma}_{1}$ and $\delta=2$ for $\mathbf{\Gamma}_{2}$. Thus, $\mathbf{\Gamma}_{1}$ declines fast and $\mathbf{L}^{E}_{1}$ goes stationary early
  • Figure 5: Grid searching for the threshold $\lambda$ and tolerance $\tau$ in FastFace. In (a), it shows the heatmap of TAR@FAR$=1e-4$ on IJB-C is reported as the accuracy ($\%$). In (b), it illustrates the relationship of different $\lambda$ and $\mathcal{D}^{E}_{t}$. The curve of $\mathcal{D}^{E}_{t}$ under $\lambda$ will be treated as the signal to schedule the learning rate