Table of Contents
Fetching ...

Data structure > labels? Unsupervised heuristics for SVM hyperparameter estimation

Michał Cholewa, Michał Romaszewski, Przemysław Głomb

TL;DR

This work tackles the computational burden of tuning SVM hyperparameters by evaluating unsupervised heuristics that estimate $C$ and $\gamma$ from data without labels. It surveys multiple existing heuristics (Smola, Chapelle & Zien, Jaakkola & Soares, Gelbart, covtrace) and introduces a new extension (MC) for the $C$ parameter, combining them with Gaussian RBF kernels. Through experiments on 31 KEEL datasets and Bayesian analysis with a 1% rope for practical equivalence, the authors show that covtrace+MC often matches GSCV in accuracy while delivering 100–200× faster computation, with several heuristics achieving practical equivalence on many datasets. The findings suggest that unsupervised SVM calibration is viable for rapid deployment on resource-constrained platforms and limited-label settings, though performance can degrade when clustering assumptions do not hold.

Abstract

Classification is one of the main areas of pattern recognition research, and within it, Support Vector Machine (SVM) is one of the most popular methods outside of field of deep learning -- and a de-facto reference for many Machine Learning approaches. Its performance is determined by parameter selection, which is usually achieved by a time-consuming grid search cross-validation procedure (GSCV). That method, however relies on the availability and quality of labelled examples and thus, when those are limited can be hindered. To address that problem, there exist several unsupervised heuristics that take advantage of the characteristics of the dataset for selecting parameters instead of using class label information. While an order of magnitude faster, they are scarcely used under the assumption that their results are significantly worse than those of grid search. To challenge that assumption, we have proposed improved heuristics for SVM parameter selection and tested it against GSCV and state of the art heuristics on over 30 standard classification datasets. The results show not only its advantage over state-of-art heuristics but also that it is statistically no worse than GSCV.

Data structure > labels? Unsupervised heuristics for SVM hyperparameter estimation

TL;DR

This work tackles the computational burden of tuning SVM hyperparameters by evaluating unsupervised heuristics that estimate and from data without labels. It surveys multiple existing heuristics (Smola, Chapelle & Zien, Jaakkola & Soares, Gelbart, covtrace) and introduces a new extension (MC) for the parameter, combining them with Gaussian RBF kernels. Through experiments on 31 KEEL datasets and Bayesian analysis with a 1% rope for practical equivalence, the authors show that covtrace+MC often matches GSCV in accuracy while delivering 100–200× faster computation, with several heuristics achieving practical equivalence on many datasets. The findings suggest that unsupervised SVM calibration is viable for rapid deployment on resource-constrained platforms and limited-label settings, though performance can degrade when clustering assumptions do not hold.

Abstract

Classification is one of the main areas of pattern recognition research, and within it, Support Vector Machine (SVM) is one of the most popular methods outside of field of deep learning -- and a de-facto reference for many Machine Learning approaches. Its performance is determined by parameter selection, which is usually achieved by a time-consuming grid search cross-validation procedure (GSCV). That method, however relies on the availability and quality of labelled examples and thus, when those are limited can be hindered. To address that problem, there exist several unsupervised heuristics that take advantage of the characteristics of the dataset for selecting parameters instead of using class label information. While an order of magnitude faster, they are scarcely used under the assumption that their results are significantly worse than those of grid search. To challenge that assumption, we have proposed improved heuristics for SVM parameter selection and tested it against GSCV and state of the art heuristics on over 30 standard classification datasets. The results show not only its advantage over state-of-art heuristics but also that it is statistically no worse than GSCV.

Paper Structure

This paper contains 23 sections, 19 equations, 3 figures.

Figures (3)

  • Figure 1: Example SVM behaviour on first two features from the 'Breast Cancer Wisconsin Dataset' (wdbc). Red crosses and blue circles mark the position of data points from two classes. Solid line presents the decision boundary, dashed lines denote margin ranges. Presented cases show the example influence of values of $C$ and $\gamma$ parameters, both for good and bad values.
  • Figure 2: The impact of SVM parameters on its accuracy. Parameter values are presented in logspace. Accuracy values were obtained from experiments with 5-fold CV by sampling each pair of parameters from the $50\times50$ parameter grid. The highest value of accuracy is denoted as 'best'. Marked points denote results of unsupervised heuristics from this paper, with the five heuristics scoring highest marked with colour.
  • Figure 3: Visualisation of Bayesian analysis of results with methodology from benavoli2017time for selected cases from Table \ref{['tab:results_bayes_aa']}: covtrace+default, covtrace+Chapelle, covtrace+MC. Vertices of the simplex represent decisions with certainty in favour of: CV (lower left), example heuristics (lower right) and rope (top); the latter corresponds to practical equivalence of CV and heuristics accuracy. Points represent Monte Carlo sampling of posterior probabilities in barycentric coordinates. BA denotes balanced accuracy, OA denotes overall accuracy. Note that the better the $C$ heuristics, the closer to equivalence of UH-SVM and GSCV-SVM. Our proposed extension (MC) provides the best results. The tendency visible in this plot is similar across other well-performing $\gamma$ heuristics.