Table of Contents
Fetching ...

UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity

Jialong Zuo, Hanyu Zhou, Ying Nie, Feng Zhang, Tianyu Guo, Nong Sang, Yunhe Wang, Changxin Gao

TL;DR

UFineBench tackles the gap of coarse textual annotations in text-based person retrieval by introducing UFine6926, a large-scale dataset with ultra-fine descriptions (average 80.8 words per image) and two descriptions per image; it also presents UFine3C, a cross-domain evaluation set, and the mean Similarity Distribution ($mSD$) metric to capture continuous retrieval quality. The proposed CFAM framework leverages a shared cross-modal granularity decoder and hard negative matching to achieve fine-grained cross-modal alignment, delivering strong performance on UFine6926 and robust generalization to real-world, cross-domain scenarios. Across standard benchmarks and in cross settings, CFAM demonstrates competitive or superior results, illustrating the value of high-textual granularity and cross-modal granularity modeling. The dataset and code are publicly available, enabling further exploration of fine-grained, cross-modal retrieval research.

Abstract

Existing text-based person retrieval datasets often have relatively coarse-grained text annotations. This hinders the model to comprehend the fine-grained semantics of query texts in real scenarios. To address this problem, we contribute a new benchmark named \textbf{UFineBench} for text-based person retrieval with ultra-fine granularity. Firstly, we construct a new \textbf{dataset} named UFine6926. We collect a large number of person images and manually annotate each image with two detailed textual descriptions, averaging 80.8 words each. The average word count is three to four times that of the previous datasets. In addition of standard in-domain evaluation, we also propose a special \textbf{evaluation paradigm} more representative of real scenarios. It contains a new evaluation set with cross domains, cross textual granularity and cross textual styles, named UFine3C, and a new evaluation metric for accurately measuring retrieval ability, named mean Similarity Distribution (mSD). Moreover, we propose CFAM, a more efficient \textbf{algorithm} especially designed for text-based person retrieval with ultra fine-grained texts. It achieves fine granularity mining by adopting a shared cross-modal granularity decoder and hard negative match mechanism. With standard in-domain evaluation, CFAM establishes competitive performance across various datasets, especially on our ultra fine-grained UFine6926. Furthermore, by evaluating on UFine3C, we demonstrate that training on our UFine6926 significantly improves generalization to real scenarios compared with other coarse-grained datasets. The dataset and code will be made publicly available at \url{https://github.com/Zplusdragon/UFineBench}.

UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity

TL;DR

UFineBench tackles the gap of coarse textual annotations in text-based person retrieval by introducing UFine6926, a large-scale dataset with ultra-fine descriptions (average 80.8 words per image) and two descriptions per image; it also presents UFine3C, a cross-domain evaluation set, and the mean Similarity Distribution () metric to capture continuous retrieval quality. The proposed CFAM framework leverages a shared cross-modal granularity decoder and hard negative matching to achieve fine-grained cross-modal alignment, delivering strong performance on UFine6926 and robust generalization to real-world, cross-domain scenarios. Across standard benchmarks and in cross settings, CFAM demonstrates competitive or superior results, illustrating the value of high-textual granularity and cross-modal granularity modeling. The dataset and code are publicly available, enabling further exploration of fine-grained, cross-modal retrieval research.

Abstract

Existing text-based person retrieval datasets often have relatively coarse-grained text annotations. This hinders the model to comprehend the fine-grained semantics of query texts in real scenarios. To address this problem, we contribute a new benchmark named \textbf{UFineBench} for text-based person retrieval with ultra-fine granularity. Firstly, we construct a new \textbf{dataset} named UFine6926. We collect a large number of person images and manually annotate each image with two detailed textual descriptions, averaging 80.8 words each. The average word count is three to four times that of the previous datasets. In addition of standard in-domain evaluation, we also propose a special \textbf{evaluation paradigm} more representative of real scenarios. It contains a new evaluation set with cross domains, cross textual granularity and cross textual styles, named UFine3C, and a new evaluation metric for accurately measuring retrieval ability, named mean Similarity Distribution (mSD). Moreover, we propose CFAM, a more efficient \textbf{algorithm} especially designed for text-based person retrieval with ultra fine-grained texts. It achieves fine granularity mining by adopting a shared cross-modal granularity decoder and hard negative match mechanism. With standard in-domain evaluation, CFAM establishes competitive performance across various datasets, especially on our ultra fine-grained UFine6926. Furthermore, by evaluating on UFine3C, we demonstrate that training on our UFine6926 significantly improves generalization to real scenarios compared with other coarse-grained datasets. The dataset and code will be made publicly available at \url{https://github.com/Zplusdragon/UFineBench}.
Paper Structure (23 sections, 8 equations, 9 figures, 10 tables)

This paper contains 23 sections, 8 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Comparisons between our proposed UFine6926 and existing other datasets. (a)-(b) are the examples from CUHK-PEDES cuhkpedes. In (a), some fine-grained features not described in the text are highlighted in red boxes. In (b), the text does not provide enough details to closely match its intended identity but effectively describes the other identity. Meanwhile, two examples from UFine6926 are presented on the right, with ultra fine-grained texts. As the text details some fine-grained features in the images (highlighted in different colors correspondingly), it not only provides rich cross-modal information but also effectively distinguishes highly similar image samples.
  • Figure 2: A toy example of the difference between SD and AP metrics. Green and red boxes mean true and false matches, respectively. For these three rank lists, the AP remains 0.833. But SD = 0.536, 0.744 and 0.697, respectively.
  • Figure 3: Overview of the proposed CFAM framwork.
  • Figure 4: Comparison of rank-10 retrieval results on UFine3C between CFAM trained on UFine6926 cuhkpedes (the first row) and CUHK-PEDES (the second raw) for each textual description. The images that fully match the text are marked in green, and the unmatched ones are marked in red.
  • Figure 5: Some examples of our proposed UFine6926. Every image has two different fine-grained textual descriptions that describes the person's apperance detailedly. Some fine-grained features are highlighted in blue or orange boxes and texts accordingly.
  • ...and 4 more figures