Table of Contents
Fetching ...

Progressive Multi-task Anti-Noise Learning and Distilling Frameworks for Fine-grained Vehicle Recognition

Dichao Liu

TL;DR

This work tackles fine-grained vehicle recognition under image noise by introducing two frameworks: Progressive Multi-task Anti-noise Learning (PMAL), which adds a denoising auxiliary task via a Denoising-recognition Head, and Progressive Multi-task Distilling (PMD), which transfers PMAL-derived robustness to a standard backbone. PMAL trains multiple DRHs across shallow-to-deep layers to learn noise-invariant features, while PMD uses a teacher-student distillation paradigm with progressive guidance from intermediate features to final predictions, aided by Sharpness-Aware Minimization (SAM). Empirical results on Stanford Cars, CompCars, BIT-Vehicle, VTID2, and VIDMMR show sizable accuracy gains over state-of-the-art FGVR methods without extra inference cost, including 100% on VIDMMR in some PMD configurations. The approach offers practical impact for robust FGVR in noisy real-world ITS and surveillance scenarios by delivering high accuracy with backbone-accurate architectures and noise-resilient representations.

Abstract

Fine-grained vehicle recognition (FGVR) is an essential fundamental technology for intelligent transportation systems, but very difficult because of its inherent intra-class variation. Most previous FGVR studies only focus on the intra-class variation caused by different shooting angles, positions, etc., while the intra-class variation caused by image noise has received little attention. This paper proposes a progressive multi-task anti-noise learning (PMAL) framework and a progressive multi-task distilling (PMD) framework to solve the intra-class variation problem in FGVR due to image noise. The PMAL framework achieves high recognition accuracy by treating image denoising as an additional task in image recognition and progressively forcing a model to learn noise invariance. The PMD framework transfers the knowledge of the PMAL-trained model into the original backbone network, which produces a model with about the same recognition accuracy as the PMAL-trained model, but without any additional overheads over the original backbone network. Combining the two frameworks, we obtain models that significantly exceed previous state-of-the-art methods in recognition accuracy on two widely-used, standard FGVR datasets, namely Stanford Cars, and CompCars, as well as three additional surveillance image-based vehicle-type classification datasets, namely Beijing Institute of Technology (BIT)-Vehicle, Vehicle Type Image Data 2 (VTID2), and Vehicle Images Dataset for Make Model Recognition (VIDMMR), without any additional overheads over the original backbone networks. The source code is available at https://github.com/Dichao-Liu/Anti-noise_FGVR

Progressive Multi-task Anti-Noise Learning and Distilling Frameworks for Fine-grained Vehicle Recognition

TL;DR

This work tackles fine-grained vehicle recognition under image noise by introducing two frameworks: Progressive Multi-task Anti-noise Learning (PMAL), which adds a denoising auxiliary task via a Denoising-recognition Head, and Progressive Multi-task Distilling (PMD), which transfers PMAL-derived robustness to a standard backbone. PMAL trains multiple DRHs across shallow-to-deep layers to learn noise-invariant features, while PMD uses a teacher-student distillation paradigm with progressive guidance from intermediate features to final predictions, aided by Sharpness-Aware Minimization (SAM). Empirical results on Stanford Cars, CompCars, BIT-Vehicle, VTID2, and VIDMMR show sizable accuracy gains over state-of-the-art FGVR methods without extra inference cost, including 100% on VIDMMR in some PMD configurations. The approach offers practical impact for robust FGVR in noisy real-world ITS and surveillance scenarios by delivering high accuracy with backbone-accurate architectures and noise-resilient representations.

Abstract

Fine-grained vehicle recognition (FGVR) is an essential fundamental technology for intelligent transportation systems, but very difficult because of its inherent intra-class variation. Most previous FGVR studies only focus on the intra-class variation caused by different shooting angles, positions, etc., while the intra-class variation caused by image noise has received little attention. This paper proposes a progressive multi-task anti-noise learning (PMAL) framework and a progressive multi-task distilling (PMD) framework to solve the intra-class variation problem in FGVR due to image noise. The PMAL framework achieves high recognition accuracy by treating image denoising as an additional task in image recognition and progressively forcing a model to learn noise invariance. The PMD framework transfers the knowledge of the PMAL-trained model into the original backbone network, which produces a model with about the same recognition accuracy as the PMAL-trained model, but without any additional overheads over the original backbone network. Combining the two frameworks, we obtain models that significantly exceed previous state-of-the-art methods in recognition accuracy on two widely-used, standard FGVR datasets, namely Stanford Cars, and CompCars, as well as three additional surveillance image-based vehicle-type classification datasets, namely Beijing Institute of Technology (BIT)-Vehicle, Vehicle Type Image Data 2 (VTID2), and Vehicle Images Dataset for Make Model Recognition (VIDMMR), without any additional overheads over the original backbone networks. The source code is available at https://github.com/Dichao-Liu/Anti-noise_FGVR
Paper Structure (14 sections, 10 equations, 6 figures, 7 tables, 2 algorithms)

This paper contains 14 sections, 10 equations, 6 figures, 7 tables, 2 algorithms.

Figures (6)

  • Figure 1: Motivation. Convolutional neural networks (CNNs) are very susceptible to noise interruptions in the fine-grained vehicle recognition task. We add a very slight random normal noise (the standard deviation is set as 0.01) to the testing images (normalized to between 0 and 1) from the Stanford Cars dataset KrauseStarkDengFei-Fei_3DRR2013 and use a Resnet50 he2016deep trained on the original training images of this dataset to recognize the original testing images and noisy testing images, respectively. To the human eye, there is almost no difference between the original image and the noisy image, but the CNN recognizes the noisy images as the wrong models. In real-world intelligent transportation system applications, the obtained vehicle images are commonly affected by image noise. This paper focuses on addressing the intra-class variation caused by image noise.
  • Figure 2: Illustration of the architecture of the Denoising-recognition Head (DRH). DRH is based on the intermediate feature from a certain layer of the backbone CNN that is fed a noisy image. DRH consists of a recognition sub-head and a denoising sub-head. The recognition sub-head takes the intermediate feature as input and predicts the vehicle model. The denoising sub-head takes the intermediate feature and the noisy image as input and restores a clean image. In this figure, Conv, ReLU, BN, GMP, FC, ELU, and PS are abbreviations for convolution, rectified linear unit, batch normalization, global maximum pooling, fully-connected, exponential linear unit, and pixelshuffle layers, respectively. Convolution and fully-connected layers are represented by their filter sizes, e.g., $[C,\frac{D}{2},1\times1]$ represents a convolution layer with $C$ input channels, $\frac{D}{2}$ output channels, and a spatial size of $1\times1$. $[D,\frac{D}{2}]$ represents a fully-connected layer with $D$ input neurons and $\frac{D}{2}$ output neurons. In all convolution layers of DRH, the stride is set as 1, and padding is applied to make the spatial size between the input and output constant. Feature maps are represented by "number of channels"$\times$"height"$\times$"width". One-dimensional descriptors are represented by their number of neurons. Pixelshuffle layers are represented by their upsampling scale. $D$, $D'$, and $D"$ are manual hyperparameters that control the number of channels in convolution and fully-connected layers.
  • Figure 3: Illustration of the progressive multi-task distilling framework. Both the teacher and student networks have J stages, and three DRHs are installed in the teacher network. Different training steps are illustrated in different colors.
  • Figure 4: In the experiment of this work, we construct three DRHs from shallow to deep layers. This figure illustrates the histograms of the weights associated with the first, second, and third DRHs, respectively. The backbone is a Resnet50 pre-trained on ImageNet deng2009imagenet. The distribution of the three weight histograms is similar, and their L2 norm is in the same order of magnitude. Thus we use the same $\rho$ for the different DRHs in the experiments.
  • Figure 5: Denoised results. Each group of five images, from left to right, shows the original image, the noisy image, and the noise reduction results derived from $S_{\rm den}^1$, $S_{\rm den}^2$, and $S_{\rm den}^3$. Given noisy images, $S_{\rm den}^1$, $S_{\rm den}^2$, and $S_{\rm den}^3$ can generate smooth and noise-free images.
  • ...and 1 more figures