Study on Aspect Ratio Variability toward Robustness of Vision Transformer-based Vehicle Re-identification
Mei Qiu, Lauren Christopher, Lingxi Li
TL;DR
This work addresses the sensitivity of Vision Transformer–based vehicle Re-ID to input aspect ratios. It introduces a multi-aspect-ratio ViT framework with uneven patch strides, intra-image Patch Mixing guided by spatial attention, and dynamic feature fusion at inference to fuse information from models trained on different aspect ratios. The approach yields substantial gains, achieving up to $mAP=91.0\%$ on VehicleID and outperforming several state-of-the-art methods, while also improving robustness to aspect-ratio variations on VeRi-776. The practical impact lies in enhancing Re-ID reliability across diverse imaging conditions without changing the underlying ViT architecture, at the cost of increased inference time that could be mitigated by pruning or selective patching.
Abstract
Vision Transformers (ViTs) have excelled in vehicle re-identification (ReID) tasks. However, non-square aspect ratios of image or video input might significantly affect the re-identification performance. To address this issue, we propose a novel ViT-based ReID framework in this paper, which fuses models trained on a variety of aspect ratios. Our main contributions are threefold: (i) We analyze aspect ratio performance on VeRi-776 and VehicleID datasets, guiding input settings based on aspect ratios of original images. (ii) We introduce patch-wise mixup intra-image during ViT patchification (guided by spatial attention scores) and implement uneven stride for better object aspect ratio matching. (iii) We propose a dynamic feature fusing ReID network, enhancing model robustness. Our ReID method achieves a significantly improved mean Average Precision (mAP) of 91.0\% compared to the the closest state-of-the-art (CAL) result of 80.9\% on VehicleID dataset.
