Table of Contents
Fetching ...

Contributing Dimension Structure of Deep Feature for Coreset Selection

Zhijing Wan, Zhixiang Wang, Yuran Wang, Zheng Wang, Hongyuan Zhu, Shin'ichi Satoh

TL;DR

This work proposes a feature-based diversity constraint, compelling the chosen subset to exhibit maximum diversity, and enhances the performance of five classical selection methods by integrating the CDS constraint.

Abstract

Coreset selection seeks to choose a subset of crucial training samples for efficient learning. It has gained traction in deep learning, particularly with the surge in training dataset sizes. Sample selection hinges on two main aspects: a sample's representation in enhancing performance and the role of sample diversity in averting overfitting. Existing methods typically measure both the representation and diversity of data based on similarity metrics, such as L2-norm. They have capably tackled representation via distribution matching guided by the similarities of features, gradients, or other information between data. However, the results of effectively diverse sample selection are mired in sub-optimality. This is because the similarity metrics usually simply aggregate dimension similarities without acknowledging disparities among the dimensions that significantly contribute to the final similarity. As a result, they fall short of adequately capturing diversity. To address this, we propose a feature-based diversity constraint, compelling the chosen subset to exhibit maximum diversity. Our key lies in the introduction of a novel Contributing Dimension Structure (CDS) metric. Different from similarity metrics that measure the overall similarity of high-dimensional features, our CDS metric considers not only the reduction of redundancy in feature dimensions, but also the difference between dimensions that contribute significantly to the final similarity. We reveal that existing methods tend to favor samples with similar CDS, leading to a reduced variety of CDS types within the coreset and subsequently hindering model performance. In response, we enhance the performance of five classical selection methods by integrating the CDS constraint. Our experiments on three datasets demonstrate the general effectiveness of the proposed method in boosting existing methods.

Contributing Dimension Structure of Deep Feature for Coreset Selection

TL;DR

This work proposes a feature-based diversity constraint, compelling the chosen subset to exhibit maximum diversity, and enhances the performance of five classical selection methods by integrating the CDS constraint.

Abstract

Coreset selection seeks to choose a subset of crucial training samples for efficient learning. It has gained traction in deep learning, particularly with the surge in training dataset sizes. Sample selection hinges on two main aspects: a sample's representation in enhancing performance and the role of sample diversity in averting overfitting. Existing methods typically measure both the representation and diversity of data based on similarity metrics, such as L2-norm. They have capably tackled representation via distribution matching guided by the similarities of features, gradients, or other information between data. However, the results of effectively diverse sample selection are mired in sub-optimality. This is because the similarity metrics usually simply aggregate dimension similarities without acknowledging disparities among the dimensions that significantly contribute to the final similarity. As a result, they fall short of adequately capturing diversity. To address this, we propose a feature-based diversity constraint, compelling the chosen subset to exhibit maximum diversity. Our key lies in the introduction of a novel Contributing Dimension Structure (CDS) metric. Different from similarity metrics that measure the overall similarity of high-dimensional features, our CDS metric considers not only the reduction of redundancy in feature dimensions, but also the difference between dimensions that contribute significantly to the final similarity. We reveal that existing methods tend to favor samples with similar CDS, leading to a reduced variety of CDS types within the coreset and subsequently hindering model performance. In response, we enhance the performance of five classical selection methods by integrating the CDS constraint. Our experiments on three datasets demonstrate the general effectiveness of the proposed method in boosting existing methods.
Paper Structure (35 sections, 11 equations, 11 figures, 4 tables, 2 algorithms)

This paper contains 35 sections, 11 equations, 11 figures, 4 tables, 2 algorithms.

Figures (11)

  • Figure 1: Our method and motivation. (a) We combine the proposed CDS metric and constraint with the current coreset selection pipeline. (b) CDS metric and constraint enhance the performance of SOTA---GC iyer2021submodular. Although replacing the CDS metric with $L$2 distance employed by previous feature-based methods can improve GC, integrating our proposed CDS metric is more effective since it can capture more diverse, informative samples. (c) Previous feature-based methods using $L$2 metric could treat three distinct samples as equivalent, while (d) our CDS metric effectively distinguishes these samples by pruning the feature space and representing the space in different partitions. Note that here, we set the pruned dimension (C-dim) to 2 for demonstration.
  • Figure 2: Different metrics. $L$1 and $L$2 quantify the magnitude between samples, while the Cosine distance evaluates the direction between samples. In contrast, our introduced CDS metric evaluates the impact of different dimensions, making it an effective tool for assessing diversity. Note that the reference sample is positioned at the origin for $L$1, $L$2, and CDS, while it is located along the positive horizontal axis for Cosine distance. Regions with the same color indicate the same score.
  • Figure 3: CDS metric for deep features. Given the high-dimensional feature matrix, we first reduce its dimension from $K$ to $k$ using PCA. Then, we compute the central feature $[\mu_{0}$, $\mu_{1}$, …, $\mu_{k-1}]$ of the dimension reduced feature matrix. Next, we obtain the CDS for each data by comparing the difference between each data feature and the central feature in each dimension with a threshold $\beta$ to divide the feature space into different partitions. Finally, the CDS relationship matrix $\bm{R}$ used for the subsequent CDS constraint is obtained by comparing the CDS between each data individually to see if they are the same.
  • Figure 4: Analyses. (a) The improvement of sampling more same CDS strategy or more different CDS strategy over random sampling. The strategy of sampling more of the same CDS performs worse than random sampling. The strategy of sampling more different CDSs performs better than random sampling, especially with low sampling rates of 0.1%-10%. The result motivates us to select more samples with different CDS. (b) We compare the CDS distribution of coresets (1% of the CIFAR-10) selected by the baseline method and our improved counterparts. It exhibits that previous methods tend to choose a few certain CDSs, which could lead the trained model to perform worse than random sampling. Integrating our proposed constraint explicitly increases the diversity of CDS in the selected coreset.
  • Figure 5: Performance improvement over baselines. We improve current methods with our proposed CDS metric and constraint. We compare the improved versions with respective baselines on CIFAR-10 (a--c) and TinyImageNet (d--f) under the class-balanced sampling setting.
  • ...and 6 more figures