Table of Contents
Fetching ...

Transformer for Object Re-Identification: A Survey

Mang Ye, Shuoyi Chen, Chenyue Li, Wei-Shi Zheng, David Crandall, Bo Du

TL;DR

A comprehensive review and in-depth analysis of the Transformer-based Re-ID is provided, categorizing existing works into Image/Video-Based Re-ID, Re-ID with limited data/annotations, Cross-Modal Re-ID, and Special Re-ID Scenarios, and thoroughly elucidate the advantages demonstrated by the Transformer.

Abstract

Object Re-identification (Re-ID) aims to identify specific objects across different times and scenes, which is a widely researched task in computer vision. For a prolonged period, this field has been predominantly driven by deep learning technology based on convolutional neural networks. In recent years, the emergence of Vision Transformers has spurred a growing number of studies delving deeper into Transformer-based Re-ID, continuously breaking performance records and witnessing significant progress in the Re-ID field. Offering a powerful, flexible, and unified solution, Transformers cater to a wide array of Re-ID tasks with unparalleled efficacy. This paper provides a comprehensive review and in-depth analysis of the Transformer-based Re-ID. In categorizing existing works into Image/Video-Based Re-ID, Re-ID with limited data/annotations, Cross-Modal Re-ID, and Special Re-ID Scenarios, we thoroughly elucidate the advantages demonstrated by the Transformer in addressing a multitude of challenges across these domains. Considering the trending unsupervised Re-ID, we propose a new Transformer baseline, UntransReID, achieving state-of-the-art performance on both single/cross modal tasks. For the under-explored animal Re-ID, we devise a standardized experimental benchmark and conduct extensive experiments to explore the applicability of Transformer for this task and facilitate future research. Finally, we discuss some important yet under-investigated open issues in the large foundation model era, we believe it will serve as a new handbook for researchers in this field. A periodically updated website will be available at https://github.com/mangye16/ReID-Survey.

Transformer for Object Re-Identification: A Survey

TL;DR

A comprehensive review and in-depth analysis of the Transformer-based Re-ID is provided, categorizing existing works into Image/Video-Based Re-ID, Re-ID with limited data/annotations, Cross-Modal Re-ID, and Special Re-ID Scenarios, and thoroughly elucidate the advantages demonstrated by the Transformer.

Abstract

Object Re-identification (Re-ID) aims to identify specific objects across different times and scenes, which is a widely researched task in computer vision. For a prolonged period, this field has been predominantly driven by deep learning technology based on convolutional neural networks. In recent years, the emergence of Vision Transformers has spurred a growing number of studies delving deeper into Transformer-based Re-ID, continuously breaking performance records and witnessing significant progress in the Re-ID field. Offering a powerful, flexible, and unified solution, Transformers cater to a wide array of Re-ID tasks with unparalleled efficacy. This paper provides a comprehensive review and in-depth analysis of the Transformer-based Re-ID. In categorizing existing works into Image/Video-Based Re-ID, Re-ID with limited data/annotations, Cross-Modal Re-ID, and Special Re-ID Scenarios, we thoroughly elucidate the advantages demonstrated by the Transformer in addressing a multitude of challenges across these domains. Considering the trending unsupervised Re-ID, we propose a new Transformer baseline, UntransReID, achieving state-of-the-art performance on both single/cross modal tasks. For the under-explored animal Re-ID, we devise a standardized experimental benchmark and conduct extensive experiments to explore the applicability of Transformer for this task and facilitate future research. Finally, we discuss some important yet under-investigated open issues in the large foundation model era, we believe it will serve as a new handbook for researchers in this field. A periodically updated website will be available at https://github.com/mangye16/ReID-Survey.
Paper Structure (29 sections, 2 equations, 8 figures, 7 tables)

This paper contains 29 sections, 2 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: (1) We show the performance of recent state-of-the-art methods on the widely-used person Re-ID dataset MSMT17 (left). The transformer-based methods have achieved a comprehensive lead in accuracy since 2021, while the CNN-based method for single-modal image Re-ID has not been studied. (2) We show state-of-the-art results of representative works in different Re-ID tasks: unsupervised (USL) Re-ID on MSMT17 wei2018person dataset, Text-Image on CUHK-PEDES li2017person, end-to-end person search on PRW zheng2017person, Re-ID in UAVs on PRAI-1581 zhang2020person, and cloth-changing Re-ID on LTCC qian2020long.
  • Figure 2: An overview of the framework structure for the survey, illustrating key sections and their interrelationships.
  • Figure 3: General object Re-ID process. Given a query that can be any type of image, text, video, etc., the goal of Re-ID is to search for the specific object from gallery data collected by different cameras.
  • Figure 4: The first pure transformer baseline for object Re-ID he2021transreid. The Vision Transformer backbone dosovitskiyimage is adopted as a feature extractor, optimized with ID loss and triplet loss arxiv17triplet widely used in Re-ID.
  • Figure 5: Different Transformer architectures designed for image-based Re-ID. (a) The basic Re-ID baseline based on Vision Transformer he2021transreid. (b) Pyramid Transformer for learning multi-scale features li2022pyramidal. (c) Transformer and CNN hybrid architecture for aggregating hierarchical features zhang2021hat. (d) Combination of graph structure and Transformer shen2023git.
  • ...and 3 more figures