Meta-Instance Selection. Instance Selection as a Classification Problem with Meta-Features

Marcin Blachnik; Piotr Ciepliński

Meta-Instance Selection. Instance Selection as a Classification Problem with Meta-Features

Marcin Blachnik, Piotr Ciepliński

TL;DR

This work tackles data pruning for nearest-neighbor classifiers by reframing instance selection as binary classification in a unified meta-feature space derived from the nearest-neighbor graph, trained on multiple datasets using labels from established reference methods. The core contribution is the MetaIS framework, which builds a meta-space from NN graph descriptors, aggregates meta-datasets across datasets, and trains a meta-classifier (recommended Balanced Random Forest) to predict whether a sample should be removed, enabling a fast single-pass selection on new data. Key findings show that MetaIS can match or exceed several reference methods' accuracy at similar compression levels while delivering substantial speedups, and that the approach generalizes across datasets with a scalable training process. The approach offers practical impact by providing a data-driven, threshold-controlled mechanism to reduce training data and accelerate prediction without requiring repeated iterative pruning, with balanced-random-forest-based meta-classification offering robust performance on imbalanced meta-data.

Abstract

Data pruning, or instance selection, is an important problem in machine learning especially in terms of nearest neighbour classifier. However, in data pruning which speeds up the prediction phase, there is an issue related to the speed and efficiency of the process itself. In response, the study proposes an approach involving transforming the instance selection process into a classification task conducted in a unified meta-feature space where each instance can be classified and assigned to either the "to keep" or "to remove" class. This approach requires training an appropriate meta-classifier, which can be developed based on historical instance selection results from other datasets using reference instance selection methods as a labeling tool. This work proposes constructing the meta-feature space based on properties extracted from the nearest neighbor graph. Experiments conducted on 17 datasets of varying sizes and five reference instance selection methods (ENN, Drop3, ICF, HMN-EI, and CCIS) demonstrate that the proposed solution achieves results comparable to reference instance selection methods while significantly reducing computational complexity. In the proposed approach, the computational complexity of the system depends only on identifying the k-nearest neighbors for each data sample and running the meta-classifier. Additionally, the study discusses the choice of meta-classifier, recommending the use of Balanced Random Forest.

Meta-Instance Selection. Instance Selection as a Classification Problem with Meta-Features

TL;DR

Abstract

Paper Structure (18 sections, 1 equation, 8 figures, 6 tables)

This paper contains 18 sections, 1 equation, 8 figures, 6 tables.

Intrduction
Related Work
Meta-Classifier-Based Instance Selection
Meta-Descriptors of the Nearest Neighbor Graph
Computational Complexity Analysis
System and Experimental Design
Reference methods
The Procedure of Assessing the Quality of Instance Selection
Datasets Used in the Experiments
Performance Metrics
Implementation Details
Results
Performance Comparison
Execution Time
Assessment of the Meta-Classifier Performance
...and 3 more sections

Figures (8)

Figure 1: The concept of data processing in the proposed algorithm.
Figure 2: The concept of a single dataset processing in the proposed algorithm. The labelling (marked in yellow) is applied only during the training meta-set preparation.
Figure 3: Graphical representation of the meta-classifier training and application. The green color indicates the procedure for creating the meta-training set, the blue color indicates the procedure for training the meta-classifier, and the orange color indicates the procedure for applying the classifier.
Figure 4: Two types of performance measures used in the experiments. The performance is measured as the urea under the accuracy-compression curve. Performance \ref{['fig:perf_1']} is limited by the reduction rate of the reference instance selection model. Performance \ref{['fig:perf_2']} is not limited by the reduction rate of the reference instance selection model, it is bounded by the last value of reduction rate. In the figures the area is hatched.
Figure 5: Comparison of the reference instance selection methods (marked as X) with the meta instance selection (marked with a curve) in terms of prediction performance and reduction rate. Colors correspond to a particular reference instance selection method and its corresponding meta-instance selection.
...and 3 more figures

Meta-Instance Selection. Instance Selection as a Classification Problem with Meta-Features

TL;DR

Abstract

Meta-Instance Selection. Instance Selection as a Classification Problem with Meta-Features

Authors

TL;DR

Abstract

Table of Contents

Figures (8)