Table of Contents
Fetching ...

Data Pruning Can Do More: A Comprehensive Data Pruning Approach for Object Re-identification

Zi Yang, Haojin Yang, Soumajit Majumder, Jorge Cardoso, Guillermo Gallego

TL;DR

This work tackles data quality issues in object re-identification (ReID) by introducing a comprehensive data pruning framework that exploits the full logit trajectory during training. By computing a soft-label for each sample as $ ilde{\boldsymbol{y}} = \sigma\left(\frac{1}{T}\sum_{t=1}^{T} \mathbf{z}^{(t)}(\boldsymbol{x})\right)$ and using its entropy $H(\tilde{\boldsymbol{y}})$ as the importance score, the method robustly identifies informative samples while enabling label correction and outlier removal. The approach is plug-and-play and architecture-agnostic, achieving significant training-time reductions (43–60% less data and up to 10x cheaper importance-score estimation) and pruning performance of 35% (VeRi), 30% (MSMT17), and 5% (Market1501) with negligible accuracy loss ($<0.1\%$), and it generalizes to classification datasets as well. Overall, the framework offers practical gains in data efficiency and robustness to noisy labels, with strong potential for extension to unsupervised pruning in the future.

Abstract

Previous studies have demonstrated that not each sample in a dataset is of equal importance during training. Data pruning aims to remove less important or informative samples while still achieving comparable results as training on the original (untruncated) dataset, thereby reducing storage and training costs. However, the majority of data pruning methods are applied to image classification tasks. To our knowledge, this work is the first to explore the feasibility of these pruning methods applied to object re-identification (ReID) tasks, while also presenting a more comprehensive data pruning approach. By fully leveraging the logit history during training, our approach offers a more accurate and comprehensive metric for quantifying sample importance, as well as correcting mislabeled samples and recognizing outliers. Furthermore, our approach is highly efficient, reducing the cost of importance score estimation by 10 times compared to existing methods. Our approach is a plug-and-play, architecture-agnostic framework that can eliminate/reduce 35%, 30%, and 5% of samples/training time on the VeRi, MSMT17 and Market1501 datasets, respectively, with negligible loss in accuracy (< 0.1%). The lists of important, mislabeled, and outlier samples from these ReID datasets are available at https://github.com/Zi-Y/data-pruning-reid.

Data Pruning Can Do More: A Comprehensive Data Pruning Approach for Object Re-identification

TL;DR

This work tackles data quality issues in object re-identification (ReID) by introducing a comprehensive data pruning framework that exploits the full logit trajectory during training. By computing a soft-label for each sample as and using its entropy as the importance score, the method robustly identifies informative samples while enabling label correction and outlier removal. The approach is plug-and-play and architecture-agnostic, achieving significant training-time reductions (43–60% less data and up to 10x cheaper importance-score estimation) and pruning performance of 35% (VeRi), 30% (MSMT17), and 5% (Market1501) with negligible accuracy loss (), and it generalizes to classification datasets as well. Overall, the framework offers practical gains in data efficiency and robustness to noisy labels, with strong potential for extension to unsupervised pruning in the future.

Abstract

Previous studies have demonstrated that not each sample in a dataset is of equal importance during training. Data pruning aims to remove less important or informative samples while still achieving comparable results as training on the original (untruncated) dataset, thereby reducing storage and training costs. However, the majority of data pruning methods are applied to image classification tasks. To our knowledge, this work is the first to explore the feasibility of these pruning methods applied to object re-identification (ReID) tasks, while also presenting a more comprehensive data pruning approach. By fully leveraging the logit history during training, our approach offers a more accurate and comprehensive metric for quantifying sample importance, as well as correcting mislabeled samples and recognizing outliers. Furthermore, our approach is highly efficient, reducing the cost of importance score estimation by 10 times compared to existing methods. Our approach is a plug-and-play, architecture-agnostic framework that can eliminate/reduce 35%, 30%, and 5% of samples/training time on the VeRi, MSMT17 and Market1501 datasets, respectively, with negligible loss in accuracy (< 0.1%). The lists of important, mislabeled, and outlier samples from these ReID datasets are available at https://github.com/Zi-Y/data-pruning-reid.

Paper Structure

This paper contains 37 sections, 2 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: (a) The workflow of data pruning. (b) Our data pruning approach not only identifies less important samples, but also rectifies mislabeled samples and removes outliers (boxes highlighted in turquoise). (c) demonstrates an example of our method, where all images are from the same person (id 630, MSMT17 dataset). Our approach serves as a "pre-processing" step, reducing the dataset size to save storage and training costs of ReID models while having minimal impact on their accuracy.
  • Figure 2: Logit trajectories for three samples (i.e., the evolution of the log probabilities of each sample belonging to each class over the course of training): (a) easy sample, (b) hard sample and (c) harder sample. There are three classes in total and each logit trajectory corresponds to one of such classes. The difference between the logits for each class can reflect the level of difficulty of this sample. In general, the more difficult the sample, the more important it is. The forgetting scores cannot distinguish (a) and (b), while the EL2N score relies solely on the model’s prediction at the last epoch (thus without considering the history or "training dynamics"), hence it cannot differentiate between (b) and (c). Our approach fully exploits the training dynamics of a sample by utilizing the average logits values over all epochs to generate a more robust soft label. Then, the entropy of this soft label is employed to summarize the importance of the sample.
  • Figure 3: Illustration of the generated soft labels averaged over 12 training epochs for different sample types (a)--(d) and their entropies. Multi-target coexistence (d) is one such type of outlier. We show the top 3 identities. Notably, our soft label can accurately indicate the ground-truth label of the mislabeled sample (c) without being influenced by the erroneous label. Images are from Market1501 dataset.
  • Figure 4: Data pruning on ReID datasets. We report the mean of Rank1 and mAP on 3 ReID datasets (labeled at the top), obtained by training on the pruned datasets. For each method, we carry out four independent runs with different random seeds and report the mean.
  • Figure 5: Generalization performance. We train a ResNet101 and a ViT model using the sample ordering of ResNet50 on the MSMT17 dataset. For each method, we carry out four independent runs with different random seeds and we report the mean values. Shaded areas mean +/- one standard deviation of four runs.
  • ...and 14 more figures