Table of Contents
Fetching ...

Adaptive Aspect Ratios with Patch-Mixup-ViT-based Vehicle ReID

Mei Qiu, Lauren Ann Christopher, Stanley Chien, Lingxi Li

TL;DR

A novel, human perception driven, and general ViT-based ReID framework that fuses models trained on various aspect ratios that outperforms state-of-the-art transformer-based approaches on both datasets, with only a minimal increase in inference time per image.

Abstract

Vision Transformers (ViTs) have shown exceptional performance in vehicle re-identification (ReID) tasks. However, non-square aspect ratios of image or video inputs can negatively impact re-identification accuracy. To address this challenge, we propose a novel, human perception driven, and general ViT-based ReID framework that fuses models trained on various aspect ratios. Our key contributions are threefold: (i) We analyze the impact of aspect ratios on performance using the VeRi-776 and VehicleID datasets, providing guidance for input settings based on the distribution of original image aspect ratios. (ii) We introduce patch-wise mixup strategy during ViT patchification (guided by spatial attention scores) and implement uneven stride for better alignment with object aspect ratios. (iii) We propose a dynamic feature fusion ReID network to enhance model robustness. Our method outperforms state-of-the-art transformer-based approaches on both datasets, with only a minimal increase in inference time per image.

Adaptive Aspect Ratios with Patch-Mixup-ViT-based Vehicle ReID

TL;DR

A novel, human perception driven, and general ViT-based ReID framework that fuses models trained on various aspect ratios that outperforms state-of-the-art transformer-based approaches on both datasets, with only a minimal increase in inference time per image.

Abstract

Vision Transformers (ViTs) have shown exceptional performance in vehicle re-identification (ReID) tasks. However, non-square aspect ratios of image or video inputs can negatively impact re-identification accuracy. To address this challenge, we propose a novel, human perception driven, and general ViT-based ReID framework that fuses models trained on various aspect ratios. Our key contributions are threefold: (i) We analyze the impact of aspect ratios on performance using the VeRi-776 and VehicleID datasets, providing guidance for input settings based on the distribution of original image aspect ratios. (ii) We introduce patch-wise mixup strategy during ViT patchification (guided by spatial attention scores) and implement uneven stride for better alignment with object aspect ratios. (iii) We propose a dynamic feature fusion ReID network to enhance model robustness. Our method outperforms state-of-the-art transformer-based approaches on both datasets, with only a minimal increase in inference time per image.

Paper Structure

This paper contains 8 sections, 10 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: (Left) Existing Method: Image size is typically fixed and set to a single square shape. (Right) Our Method: Combined Vision Transformer (ViT)-based ReID model that dynamically fuses features extracted from multiple models. Each model is trained on a different fixed size and aspect ratio.
  • Figure 2: PM module.
  • Figure 3: Examples from the VehicleID (first two columns) and VeRi-776 (last two columns) test datasets show the impact of intra-image patch mixup (PM). This method blends image parts based on attention-driven distances, enhancing complexity to boost model robustness and reduce overfitting. The top row shows images without PM, while the bottom row includes images processed with PM.
  • Figure 4: VeRi-776 attention maps without (top) and with PM module (bottom).
  • Figure 5: VehicleID attention maps without (top) and with PM module (bottom).