Table of Contents
Fetching ...

Long-Tailed Recognition on Binary Networks by Calibrating A Pre-trained Model

Jihun Kim, Dahyun Kim, Hyungrok Jung, Taeil Oh, Jonghyun Choi

TL;DR

This work tackles long-tailed recognition on resource-constrained binary neural networks by introducing Calibrate and Distill (CANDLE). A pretrained full-precision teacher is calibrated on target LT data and used to distill supervision into a binary student, with an adversarially learned balancing of distillation terms and an efficient multiresolution learning scheme to generalize across datasets. The approach yields large improvements over prior LT methods across 15 benchmarks, especially boosting tail-class accuracy, while maintaining computational efficiency suitable for edge deployment. The results demonstrate that distillation from a fixed FP teacher, when combined with dataset-aware balancing and multiresolution calibration, provides a scalable path to accurate LT recognition on binary networks. Limitations include dependency on non-LT pretrained teachers; future work could explore LT-pretrained teachers to further enhance performance and fairness across classes.

Abstract

Deploying deep models in real-world scenarios entails a number of challenges, including computational efficiency and real-world (e.g., long-tailed) data distributions. We address the combined challenge of learning long-tailed distributions using highly resource-efficient binary neural networks as backbones. Specifically, we propose a calibrate-and-distill framework that uses off-the-shelf pretrained full-precision models trained on balanced datasets to use as teachers for distillation when learning binary networks on long-tailed datasets. To better generalize to various datasets, we further propose a novel adversarial balancing among the terms in the objective function and an efficient multiresolution learning scheme. We conducted the largest empirical study in the literature using 15 datasets, including newly derived long-tailed datasets from existing balanced datasets, and show that our proposed method outperforms prior art by large margins (>14.33% on average).

Long-Tailed Recognition on Binary Networks by Calibrating A Pre-trained Model

TL;DR

This work tackles long-tailed recognition on resource-constrained binary neural networks by introducing Calibrate and Distill (CANDLE). A pretrained full-precision teacher is calibrated on target LT data and used to distill supervision into a binary student, with an adversarially learned balancing of distillation terms and an efficient multiresolution learning scheme to generalize across datasets. The approach yields large improvements over prior LT methods across 15 benchmarks, especially boosting tail-class accuracy, while maintaining computational efficiency suitable for edge deployment. The results demonstrate that distillation from a fixed FP teacher, when combined with dataset-aware balancing and multiresolution calibration, provides a scalable path to accurate LT recognition on binary networks. Limitations include dependency on non-LT pretrained teachers; future work could explore LT-pretrained teachers to further enhance performance and fairness across classes.

Abstract

Deploying deep models in real-world scenarios entails a number of challenges, including computational efficiency and real-world (e.g., long-tailed) data distributions. We address the combined challenge of learning long-tailed distributions using highly resource-efficient binary neural networks as backbones. Specifically, we propose a calibrate-and-distill framework that uses off-the-shelf pretrained full-precision models trained on balanced datasets to use as teachers for distillation when learning binary networks on long-tailed datasets. To better generalize to various datasets, we further propose a novel adversarial balancing among the terms in the objective function and an efficient multiresolution learning scheme. We conducted the largest empirical study in the literature using 15 datasets, including newly derived long-tailed datasets from existing balanced datasets, and show that our proposed method outperforms prior art by large margins (>14.33% on average).
Paper Structure (36 sections, 5 equations, 10 figures, 10 tables, 2 algorithms)

This paper contains 36 sections, 5 equations, 10 figures, 10 tables, 2 algorithms.

Figures (10)

  • Figure 1: Overview of the proposed CANDLE. Learning progresses from ① to ③. The thick blue arrows (②) indicate copying the components of teacher network (used as frozen). In calibration, the pretraind FP teacher encoder is frozen (denoted by the lock icon) and the rest are calibrated with multi-resolution inputs (colored volumetric cubes for each resolution feature map, 'Multi-Res. Feat. Map'). In distillation, we compute $\mathcal{L}_{FS}$ by one of the calibrated teachers' multi-res. pooled feature (Ⓐ in the black rectangle) and compute $\mathcal{L}_{KL}$ by calibrated teachers' logit. The purple box ($AB_{\phi}$) is the adversarial balancing network (Sec. \ref{['sec:datadriven']}).
  • Figure 2: Computational (blue bars) and memory (orange bars) costs comparison. Between directly using multi-resolution inputs to our method of using them only for the teacher ('Our Multi-Res') in terms of VRAM peak (MiB) and 1 epoch training time (sec.).
  • Figure 3: Classifier weight norm of binary networks trained from scratch on CIFAR-100 (100) using SGD with or without weight decay. We observe similar trends for the class weight norms as in kang2019decouplingalshammari2022long where the class weight norms become smaller for the tail classes. (See discussion in Sec. \ref{['sec:methodlogy']})
  • Figure 4: Classifier weight norms of binary and FP models trained with Adam without weight decay on CIFAR-100 (imbalance ratio: 100). (a) The classifier weight norms increase at the tail classes. When trained from scratch, binary networks show larger differences of weight norms than FP, as indicated by the larger slope of the curves. (b) Using CANDLE, the magnitudes of the weights of both binary and FP networks vary less, showing smaller differences across classes.
  • Figure 5: Per-class accuracy gain of CANDLE over the baseline on ImageNet-LT. Our method improves the accuracy at the tail classes by $+22.65\%$, showing its effectiveness for LT. Furthermore, the proposed method also shows gains in the accuracy at the head and medium classes by $+10.57\%$ and $+26.17\%$ respectively.
  • ...and 5 more figures