Contrastive Multiple Instance Learning for Weakly Supervised Person ReID

Jacob Tyo; Zachary C. Lipton

Contrastive Multiple Instance Learning for Weakly Supervised Person ReID

Jacob Tyo, Zachary C. Lipton

TL;DR

This work addresses the challenge of weakly supervised person re-identification by introducing Contrastive Multiple Instance Learning (CMIL), a framework that learns from bag-level labels without pseudo-labels. CMIL uses a ResNet-50 based per-image encoder and a permutation-invariant set-transformer aggregator to produce bag representations, optimized with a combination of triplet, identity, and an optional alignment loss as ${\mathcal{L}} = \alpha {\mathcal{L}_{triplet}} + \beta {\mathcal{L}_{CE}} + \gamma {\mathcal{L}_{align}}$. The authors release WL-MUDD, a real-world weakly labeled ReID dataset, and evaluate CMIL on WL-Market1501, WL-MUDD, and SYSU-30k, consistently achieving state-of-the-art or near‑state-of-the-art performance under weak supervision. Key findings include the empirical ineffectiveness of the alignment loss, and the strong and robust performance of simple aggregation like average pooling, suggesting practical viability for weakly labeled ReID tasks. The work contributes both a new dataset and a scalable, label-efficient method that narrows the gap to fully supervised ReID in real-world settings.

Abstract

The acquisition of large-scale, precisely labeled datasets for person re-identification (ReID) poses a significant challenge. Weakly supervised ReID has begun to address this issue, although its performance lags behind fully supervised methods. In response, we introduce Contrastive Multiple Instance Learning (CMIL), a novel framework tailored for more effective weakly supervised ReID. CMIL distinguishes itself by requiring only a single model and no pseudo labels while leveraging contrastive losses -- a technique that has significantly enhanced traditional ReID performance yet is absent in all prior MIL-based approaches. Through extensive experiments and analysis across three datasets, CMIL not only matches state-of-the-art performance on the large-scale SYSU-30k dataset with fewer assumptions but also consistently outperforms all baselines on the WL-market1501 and Weakly Labeled MUddy racer re-iDentification dataset (WL-MUDD) datasets. We introduce and release the WL-MUDD dataset, an extension of the MUDD dataset featuring naturally occurring weak labels from the real-world application at PerformancePhoto.co. All our code and data are accessible at https://drive.google.com/file/d/1rjMbWB6m-apHF3Wg_cfqc8QqKgQ21AsT/view?usp=drive_link.

Contrastive Multiple Instance Learning for Weakly Supervised Person ReID

TL;DR

. The authors release WL-MUDD, a real-world weakly labeled ReID dataset, and evaluate CMIL on WL-Market1501, WL-MUDD, and SYSU-30k, consistently achieving state-of-the-art or near‑state-of-the-art performance under weak supervision. Key findings include the empirical ineffectiveness of the alignment loss, and the strong and robust performance of simple aggregation like average pooling, suggesting practical viability for weakly labeled ReID tasks. The work contributes both a new dataset and a scalable, label-efficient method that narrows the gap to fully supervised ReID in real-world settings.

Abstract

Paper Structure (13 sections, 8 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 13 sections, 8 equations, 4 figures, 6 tables, 1 algorithm.

Introduction
Datasets and Problem Setup
Weakly Supervised Re-Identification
WL-MUDD Dataset
Contrastive Multiple Instance Learning
Loss Function
Experiments
Implementation Details and Hyperparameter Tuning
Baseline Methods
Results and Discussion
Ablation Study
Related Work
Conclusion

Figures (4)

Figure 1: The annotation process for strong and weak ReID. The strong annotations group each crop into a bag based on their identity, whereas the weak annotation groups all images based on a shared identity, and then all crops from the grouped images become a bag.
Figure 2: Four example subsets from four different bags of the WL-MUDD dataset. Each image within a bag is outlined in green if it is the same identity as the bag, and red if it is not. Each bag can have very different ratios of correct to incorrect identities of the underlying images.
Figure 3: The CMIL framework. For each image in a batch of bags, a feature extraction network is used to get an embedding for each image. Then for each bag, the corresponding image embeddings are combined into a single bag embedding via an accumulation function. Finally, the bag embeddings are used to calculate the cross entropy loss (or identity loss), as well as the triplet loss based on all valid triplets from the batch.
Figure 4: The rank-1 accuracy and the alignment loss throughout a training run. The alignment loss exhibits unintuitive behavior - the best alignment (i.e. lowest) does not correspond to the best model accuracy (i.e. highest). This behavior is characteristic of every model trained in this work, including those using different accumulation functions.

Contrastive Multiple Instance Learning for Weakly Supervised Person ReID

TL;DR

Abstract

Contrastive Multiple Instance Learning for Weakly Supervised Person ReID

Authors

TL;DR

Abstract

Table of Contents

Figures (4)