Table of Contents
Fetching ...

Cerberus: Attribute-based person re-identification using semantic IDs

Chanho Eom, Geon Lee, Kyunghwan Cho, Hyeonseok Jung, Moonsub Jin, Bumsub Ham

TL;DR

Cerberus tackles attribute-based person reID by learning multiple partial representations aligned to semantic IDs (SIDs) derived from grouped attributes. It introduces a semantic guidance loss $\mathcal{L}_{sem}$ to pull same-SID representations toward SID prototypes and a regularization term $\mathcal{L}_{reg}$ to infer prototypes for unseen SIDs, enabling robust zero-shot generalization. The model achieves state-of-the-art results on Market-1501 and DukeMTMC-reID for attributebased reID and delivers competitive PAR and APS performance using a single unified framework. This approach yields a practical, interpretable visual-semantic embedding that supports reID, PAR, and APS without task-specific fine-tuning, offering scalable deployment for attribute-driven surveillance tasks.

Abstract

We introduce a new framework, dubbed Cerberus, for attribute-based person re-identification (reID). Our approach leverages person attribute labels to learn local and global person representations that encode specific traits, such as gender and clothing style. To achieve this, we define semantic IDs (SIDs) by combining attribute labels, and use a semantic guidance loss to align the person representations with the prototypical features of corresponding SIDs, encouraging the representations to encode the relevant semantics. Simultaneously, we enforce the representations of the same person to be embedded closely, enabling recognizing subtle differences in appearance to discriminate persons sharing the same attribute labels. To increase the generalization ability on unseen data, we also propose a regularization method that takes advantage of the relationships between SID prototypes. Our framework performs individual comparisons of local and global person representations between query and gallery images for attribute-based reID. By exploiting the SID prototypes aligned with the corresponding representations, it can also perform person attribute recognition (PAR) and attribute-based person search (APS) without bells and whistles. Experimental results on standard benchmarks on attribute-based person reID, Market-1501 and DukeMTMC, demonstrate the superiority of our model compared to the state of the art.

Cerberus: Attribute-based person re-identification using semantic IDs

TL;DR

Cerberus tackles attribute-based person reID by learning multiple partial representations aligned to semantic IDs (SIDs) derived from grouped attributes. It introduces a semantic guidance loss to pull same-SID representations toward SID prototypes and a regularization term to infer prototypes for unseen SIDs, enabling robust zero-shot generalization. The model achieves state-of-the-art results on Market-1501 and DukeMTMC-reID for attributebased reID and delivers competitive PAR and APS performance using a single unified framework. This approach yields a practical, interpretable visual-semantic embedding that supports reID, PAR, and APS without task-specific fine-tuning, offering scalable deployment for attribute-driven surveillance tasks.

Abstract

We introduce a new framework, dubbed Cerberus, for attribute-based person re-identification (reID). Our approach leverages person attribute labels to learn local and global person representations that encode specific traits, such as gender and clothing style. To achieve this, we define semantic IDs (SIDs) by combining attribute labels, and use a semantic guidance loss to align the person representations with the prototypical features of corresponding SIDs, encouraging the representations to encode the relevant semantics. Simultaneously, we enforce the representations of the same person to be embedded closely, enabling recognizing subtle differences in appearance to discriminate persons sharing the same attribute labels. To increase the generalization ability on unseen data, we also propose a regularization method that takes advantage of the relationships between SID prototypes. Our framework performs individual comparisons of local and global person representations between query and gallery images for attribute-based reID. By exploiting the SID prototypes aligned with the corresponding representations, it can also perform person attribute recognition (PAR) and attribute-based person search (APS) without bells and whistles. Experimental results on standard benchmarks on attribute-based person reID, Market-1501 and DukeMTMC, demonstrate the superiority of our model compared to the state of the art.

Paper Structure

This paper contains 35 sections, 9 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: (a) A visualization of a network architecture for existing attribute-based reID methods liu2018ca3nethan2018attributetay2019aanet. It exploits a ResNet-50 he2016deep cropped at $\texttt{conv4-1}$ as a backbone network, and has two branches on top of that to extract features for classifying person ID and attribute labels, i.e., reID and PAR features, respectively. (b) Quantitative comparisons of features for vanilla reID and attributed-based reID on Market-1501 zheng2015scalable. Concatenating the features from both branches for a person representation rather degrades the reID performance, compared to the case that uses the reID feature alone, due to the conflicting goals between reID and PAR. (c) Examples of different persons sharing the same person attributes, e.g., clothing color or gender. (Best viewed in color.)
  • Figure 2: An overview of Cerberus. We extract global and local feature maps, denoted by $\mathbf{F}_x^{g}$ and $\mathbf{F}_x^{l}$, respectively, from a given image. We then apply global average pooling (GAP) to the global feature map $\mathbf{F}_x^{g}$, and use fully connected (FC) and batch-norm (BN) layers to obtain representations for identity ($\mathbf{f}_x^\mathrm{I}$) and carrying ($\mathbf{f}_x^\mathrm{C}$), where the size of each representation is $d$. Similarly, we incorporate a part average pooling (PAP) layer, followed by a series of fully connected (FC) and batch normalization (BN) layers, on the local feature map $\mathbf{F}_x^{l}$ to extract representations for the head, upper body, and lower body, denoted by $\mathbf{f}_x^\mathrm{H}$, $\mathbf{f}_x^\mathrm{U}$, and $\mathbf{f}_x^\mathrm{L}$, respectively, from the top, middle, and bottom parts of the image. Note that, for the local feature map $\mathbf{F}_x^{l}$, we insert an alignment module that estimates the region, where a person is likely to exist. We define SIDs, and learn corresponding prototypical features ($\mathbf{p}^\mathrm{I}_\mathrm{i}$, $\mathbf{p}^\mathrm{C}_\mathrm{c}$, $\mathbf{p}^\mathrm{H}_\mathrm{h}$, $\mathbf{p}^\mathrm{U}_\mathrm{u}$, and $\mathbf{p}^\mathrm{L}_\mathrm{l}$), which are used to guide embeddings of person representations. See the text for more details. (Best viewed in color.)
  • Figure 3: Illustrations of constructing the set of semantic IDs. See the text for more details. (Best viewed in color.)
  • Figure 4: Illustrations of the embedding spaces in our model. (a) The semantic guidance term encourages the representations of persons belonging to the same SID to be grouped close to the corresponding SID prototype. (b) The identification term enables the representations of the same person to form clusters. Accordingly, the two terms allow us to differentiate subtle differences between SIDs and ID labels. (c) We constraint the SID prototypes by their semantic relations, enabling estimating prototypes of unseen SIDs. For example, if there is no person belonging to 'old female' in the training data, its SID prototype may be positioned incorrectly in the embedding space. Using the regularization loss, we encourage the SID prototype for 'old female' to be placed near 'adult female', reflecting the relationship between the prototypes for 'adult male' and 'old male' (represented by the red dotted line). The red solid lines indicate the residual vectors as defined in Eq. \ref{['eq:residual_vector']}. The points with the same color indicate that they correspond to the same identity. See the text for more details. (Best viewed in color.)
  • Figure 5: Illustrations of inference processes for reID, APS, and PAR. (a) reID: We compare person representations of query and gallery images by computing cosine similarity between individual partial representations. (b) APS: We replace query representations with SID prototypes that the query belongs to, and calculate cosine similarity with person representations of the query. (c) PAR: We find SID prototypes that show the highest matching score with each partial representation of the query, and convert their SIDs into attributes. (Best viewed in color.)
  • ...and 5 more figures