Table of Contents
Fetching ...

Study on Aspect Ratio Variability toward Robustness of Vision Transformer-based Vehicle Re-identification

Mei Qiu, Lauren Christopher, Lingxi Li

TL;DR

This work addresses the sensitivity of Vision Transformer–based vehicle Re-ID to input aspect ratios. It introduces a multi-aspect-ratio ViT framework with uneven patch strides, intra-image Patch Mixing guided by spatial attention, and dynamic feature fusion at inference to fuse information from models trained on different aspect ratios. The approach yields substantial gains, achieving up to $mAP=91.0\%$ on VehicleID and outperforming several state-of-the-art methods, while also improving robustness to aspect-ratio variations on VeRi-776. The practical impact lies in enhancing Re-ID reliability across diverse imaging conditions without changing the underlying ViT architecture, at the cost of increased inference time that could be mitigated by pruning or selective patching.

Abstract

Vision Transformers (ViTs) have excelled in vehicle re-identification (ReID) tasks. However, non-square aspect ratios of image or video input might significantly affect the re-identification performance. To address this issue, we propose a novel ViT-based ReID framework in this paper, which fuses models trained on a variety of aspect ratios. Our main contributions are threefold: (i) We analyze aspect ratio performance on VeRi-776 and VehicleID datasets, guiding input settings based on aspect ratios of original images. (ii) We introduce patch-wise mixup intra-image during ViT patchification (guided by spatial attention scores) and implement uneven stride for better object aspect ratio matching. (iii) We propose a dynamic feature fusing ReID network, enhancing model robustness. Our ReID method achieves a significantly improved mean Average Precision (mAP) of 91.0\% compared to the the closest state-of-the-art (CAL) result of 80.9\% on VehicleID dataset.

Study on Aspect Ratio Variability toward Robustness of Vision Transformer-based Vehicle Re-identification

TL;DR

This work addresses the sensitivity of Vision Transformer–based vehicle Re-ID to input aspect ratios. It introduces a multi-aspect-ratio ViT framework with uneven patch strides, intra-image Patch Mixing guided by spatial attention, and dynamic feature fusion at inference to fuse information from models trained on different aspect ratios. The approach yields substantial gains, achieving up to on VehicleID and outperforming several state-of-the-art methods, while also improving robustness to aspect-ratio variations on VeRi-776. The practical impact lies in enhancing Re-ID reliability across diverse imaging conditions without changing the underlying ViT architecture, at the cost of increased inference time that could be mitigated by pruning or selective patching.

Abstract

Vision Transformers (ViTs) have excelled in vehicle re-identification (ReID) tasks. However, non-square aspect ratios of image or video input might significantly affect the re-identification performance. To address this issue, we propose a novel ViT-based ReID framework in this paper, which fuses models trained on a variety of aspect ratios. Our main contributions are threefold: (i) We analyze aspect ratio performance on VeRi-776 and VehicleID datasets, guiding input settings based on aspect ratios of original images. (ii) We introduce patch-wise mixup intra-image during ViT patchification (guided by spatial attention scores) and implement uneven stride for better object aspect ratio matching. (iii) We propose a dynamic feature fusing ReID network, enhancing model robustness. Our ReID method achieves a significantly improved mean Average Precision (mAP) of 91.0\% compared to the the closest state-of-the-art (CAL) result of 80.9\% on VehicleID dataset.
Paper Structure (8 sections, 8 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 8 sections, 8 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Aspect ratio distribution of images in training datasets from ReID benchmark datasets, VeRi-776 and VehicleID, varies significantly. These datasets show that a substantial portion of the images are non-square.
  • Figure 2: (Left) Existing Method: Image size is typically fixed and set to a single size with a square shape. (Right) Our Method: Combined Vision Transformer (ViT)-based ReID model that dynamically fuses features extracted from multiple models. Each model is trained on a fixed size and aspect ratio.
  • Figure 3: The structure of each individual model is designed to adapt to the dataset's size and aspect ratio distribution. During the patchification process, the stride size in the horizontal and vertical directions is dynamically determined based on the input object's aspect ratio. Subsequently, a Patch Mixing (PM) module shuffles and mixes patches from the same image using an attention-guided strategy. Any Vision Transformer (ViT)-based architecture can be chosen as the backbone. In this study, we select ViT/B-16. The features extracted from ViTs are used for the vehicle ReID downstream task.
  • Figure 4: Displayed are examples from the VehicleID (first two columns) and VeRi-776 (last two columns) test datasets, illustrating the effects of the intra-image patch mixup (PM) data augmentation method. This technique blends various parts of an image based on attention-driven distances, increasing image complexity to enhance model robustness and reduce overfitting. The top row presents images without the PM module, while the bottom row features images processed with the PM module.
  • Figure 5: ReID Performance of Square Input on VeRi-776.
  • ...and 4 more figures