CLEAR: Cross-Transformers with Pre-trained Language Model is All you need for Person Attribute Recognition and Retrieval

Doanh C. Bui; Thinh V. Le; Ba Hung Ngo; Tae Jong Choi

CLEAR: Cross-Transformers with Pre-trained Language Model is All you need for Person Attribute Recognition and Retrieval

Doanh C. Bui, Thinh V. Le, Ba Hung Ngo, Tae Jong Choi

TL;DR

CLEAR introduces a unified two-branch cross-transformers framework for person attribute recognition (PAR) and attribute-based retrieval (AR). By fusing global and local long-range dependencies via channel-aware self-attention and cross-fusion, it delivers strong PAR performance, while a GPT-derived pseudo-description and lightweight adapters enable robust cross-modal retrieval through margin learning. The approach achieves state-of-the-art or competitive results across five benchmarks (PA100K, PETA, RAPv2, Market-1501, UPAR2024) and shows substantial gains in retrieval metrics, especially on Market-1501 and UPAR2024. This unified design reduces the need for task-specific modules and demonstrates practical impact for person-centric security and retrieval applications.

Abstract

Person attribute recognition and attribute-based retrieval are two core human-centric tasks. In the recognition task, the challenge is specifying attributes depending on a person's appearance, while the retrieval task involves searching for matching persons based on attribute queries. There is a significant relationship between recognition and retrieval tasks. In this study, we demonstrate that if there is a sufficiently robust network to solve person attribute recognition, it can be adapted to facilitate better performance for the retrieval task. Another issue that needs addressing in the retrieval task is the modality gap between attribute queries and persons' images. Therefore, in this paper, we present CLEAR, a unified network designed to address both tasks. We introduce a robust cross-transformers network to handle person attribute recognition. Additionally, leveraging a pre-trained language model, we construct pseudo-descriptions for attribute queries and introduce an effective training strategy to train only a few additional parameters for adapters, facilitating the handling of the retrieval task. Finally, the unified CLEAR model is evaluated on five benchmarks: PETA, PA100K, Market-1501, RAPv2, and UPAR-2024. Without bells and whistles, CLEAR achieves state-of-the-art performance or competitive results for both tasks, significantly outperforming other competitors in terms of person retrieval performance on the widely-used Market-1501 dataset.

CLEAR: Cross-Transformers with Pre-trained Language Model is All you need for Person Attribute Recognition and Retrieval

TL;DR

Abstract

Paper Structure (13 sections, 10 equations, 5 figures, 7 tables)

This paper contains 13 sections, 10 equations, 5 figures, 7 tables.

Introduction
Related Work
Methodology
Overview
Cross-Transformers for attribute recognition
Language-based margin learning for retrieval
Experimental Results
Dataset and evaluation protocol
Implemental Details
Comparison with the State-of-the-Art
Ablation study
Quanlitative Results
Conclusion

Figures (5)

Figure 1: Unified CLEAR network for both person attribute recognition & retrieval tasks. $f^{cls}_{par}$ denotes the head classifier for attribute recognition. $f^{vis}_{ret}$ denotes the auxiliary visual encoder. $f^{text}_{ret}$ denotes the auxiliary text encoder for the soft pseudo-description constructed from query attributes. $f^{attr}_{ret}$ denotes the auxiliary encoder for binary query attributes. represents the concatenation operation. represents the scoring for matching query attributes and persons during the search process.
Figure 2: Cross-Transformers backbone ($f_{par}$) for person attribute recognition.
Figure 3: Margin learning with pseudo description (soft embedding query) and hard binary attribute (hard embedding query).
Figure 4: t-SNE visualization for ten randomly chosen queries, with each query accompanied by its corresponding set of 20 person images. $\star$ denotes query representations. $\circ$ denotes person image representations. (a) ASMR (b) Hard Binary Attribute (HA). (c) Attribute Word (W). (d) Soft Pseudo Caption (SP). (e) Soft + Hard Query (Ours).
Figure 5: Top five retrieval results of ASMR and CLEAR (ours).

CLEAR: Cross-Transformers with Pre-trained Language Model is All you need for Person Attribute Recognition and Retrieval

TL;DR

Abstract

CLEAR: Cross-Transformers with Pre-trained Language Model is All you need for Person Attribute Recognition and Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (5)