Table of Contents
Fetching ...

Weakly-supervised Part-Attention and Mentored Networks for Vehicle Re-Identification

Lisha Tang, Yi Wang, Lap-Pui Chau

TL;DR

This paper proposes a weakly-supervised Part-Attention Network (PANet) and Part-Mentored Network (PMNet) for Vehicle Re-ID and adopts Homoscedastic Uncertainty to learn the optimal weighing of ID losses.

Abstract

Vehicle re-identification (Re-ID) aims to retrieve images with the same vehicle ID across different cameras. Current part-level feature learning methods typically detect vehicle parts via uniform division, outside tools, or attention modeling. However, such part features often require expensive additional annotations and cause sub-optimal performance in case of unreliable part mask predictions. In this paper, we propose a weakly-supervised Part-Attention Network (PANet) and Part-Mentored Network (PMNet) for Vehicle Re-ID. Firstly, PANet localizes vehicle parts via part-relevant channel recalibration and cluster-based mask generation without vehicle part supervisory information. Secondly, PMNet leverages teacher-student guided learning to distill vehicle part-specific features from PANet and performs multi-scale global-part feature extraction. During inference, PMNet can adaptively extract discriminative part features without part localization by PANet, preventing unstable part mask predictions. We address this Re-ID issue as a multi-task problem and adopt Homoscedastic Uncertainty to learn the optimal weighing of ID losses. Experiments are conducted on two public benchmarks, showing that our approach outperforms recent methods, which require no extra annotations by an average increase of 3.0% in CMC@5 on VehicleID and over 1.4% in mAP on VeRi776. Moreover, our method can extend to the occluded vehicle Re-ID task and exhibits good generalization ability.

Weakly-supervised Part-Attention and Mentored Networks for Vehicle Re-Identification

TL;DR

This paper proposes a weakly-supervised Part-Attention Network (PANet) and Part-Mentored Network (PMNet) for Vehicle Re-ID and adopts Homoscedastic Uncertainty to learn the optimal weighing of ID losses.

Abstract

Vehicle re-identification (Re-ID) aims to retrieve images with the same vehicle ID across different cameras. Current part-level feature learning methods typically detect vehicle parts via uniform division, outside tools, or attention modeling. However, such part features often require expensive additional annotations and cause sub-optimal performance in case of unreliable part mask predictions. In this paper, we propose a weakly-supervised Part-Attention Network (PANet) and Part-Mentored Network (PMNet) for Vehicle Re-ID. Firstly, PANet localizes vehicle parts via part-relevant channel recalibration and cluster-based mask generation without vehicle part supervisory information. Secondly, PMNet leverages teacher-student guided learning to distill vehicle part-specific features from PANet and performs multi-scale global-part feature extraction. During inference, PMNet can adaptively extract discriminative part features without part localization by PANet, preventing unstable part mask predictions. We address this Re-ID issue as a multi-task problem and adopt Homoscedastic Uncertainty to learn the optimal weighing of ID losses. Experiments are conducted on two public benchmarks, showing that our approach outperforms recent methods, which require no extra annotations by an average increase of 3.0% in CMC@5 on VehicleID and over 1.4% in mAP on VeRi776. Moreover, our method can extend to the occluded vehicle Re-ID task and exhibits good generalization ability.

Paper Structure

This paper contains 33 sections, 8 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: (a) Two images in the first row show the large intra-class variance of the same vehicle caused by complex variations in illumination, image quality, viewpoint, and background clutter, whilst the images in the second row illustrate the minor inter-class discrepancy caused by near-duplicated vehicles. (b) Four zoomed-in patches show that the same spatial position across images may correspond to different vehicle parts due to occlusion, viewpoint changes, and diverse spatial distributions. (c) Images on the first row represent the sample image and $K=3$ part masks by our PANet, and figures on the second row show four view-aware masks generated by PVEN r11.
  • Figure 2: Architecture of the proposed Part Attention Network (PANet) and Part-Mentored Network (PMNet). (a) PANet performs $K$ vehicle part localization via part-relevant channel recalibration and cluster-based mark generation module. (b) To learn robust part-level features, PMNet applies the part masks from PANet to $K$ streams, each consisting of a Student branch and a Teacher branch. The part-level knowledge learnt by each Teacher branch is transferred to the correspondent Student branch, such that the Student branch can be retained during inference free from prior part masks by PANet. Then, the part feature learning branch is cooperated with the global feature learning branch to implement the global-part feature extraction. Multi-task learning is applied to train PMNet. For simplicity, ReLU and BatchNorm layer following each Conv layer is omitted by default.
  • Figure 3: Architecture of Part Attention Network (PANet). As the upper sub-figure illustrates, PANet consists of the ResNet-50 backbone, classification branch for Re-ID r43, and segmentation branch. PANet is supervised by an ID loss and a Mean Square Error loss during training. The BatchNorm applied to the fully connected (FC) layer is omitted for simplicity. The lower sub-figure shows the pipeline to predict $K$ dense vehicle part masks during testing, and details of Part-relevant Channel Recalibration module are exhibited in the orange rectangle. $\mathcal{S}(.)$ is the Softmax normalization.
  • Figure 4: Input generation for Teacher branches. The dashed block shows details of the Input Generation module in Fig.\ref{['Fig:PGLN']}. Here we set $K=3$ in our experiment and show visualization maps of the three inputs for three Teacher branches. These maps respectively focus on three different vehicle parts with semantic meanings, i.e., vehicle roof, windscreen, and headlights.
  • Figure 5: Structure of Multi-scale Attention Module. (a) Similar to CBAMcbam, the channel block utilizes both Global Max-Pooling (GMP) and Global Average-Pooling (GAP) outputs with a shared Multi-Layer Perceptron (MLP) to get the channel mask, $C^k$. (b) In the spatial block, $P^k$ is first fed into three $3\times3$ Conv layers with different dilation ratios (1,2 and 3) to mine multi-scale features. Then, the output goes through three Conv layers and a Sigmoid function to obtain the spatial mask, $S^k$. The three items in the bracket of a Conv layer are filter number, filter shape, and stride, respectively. The ReLU and Batch Normalization applied to each Conv layer are not shown for brevity.
  • ...and 4 more figures