Table of Contents
Fetching ...

Feature Weighting Improves Pool-Based Sequential Active Learning for Regression

Dongrui Wu

Abstract

Pool-based sequential active learning for regression (ALR) optimally selects a small number of samples sequentially from a large pool of unlabeled samples to label, so that a more accurate regression model can be constructed under a given labeling budget. Representativeness and diversity, which involve computing the distances among different samples, are important considerations in ALR. However, previous ALR approaches do not incorporate the importance of different features in inter-sample distance computation, resulting in sub-optimal sample selection. This paper proposes three feature weighted single-task ALR approaches and two feature weighted multi-task ALR approaches, where the ridge regression coefficients trained from a small amount of previously labeled samples are used to weight the corresponding features in inter-sample distance computation. Experiments showed that this easy-to-implement enhancement almost always improves the performance of four existing ALR approaches, in both single-task and multi-task regression problems. The feature weighting strategy may also be easily extended to stream-based ALR, and classification algorithms.

Feature Weighting Improves Pool-Based Sequential Active Learning for Regression

Abstract

Pool-based sequential active learning for regression (ALR) optimally selects a small number of samples sequentially from a large pool of unlabeled samples to label, so that a more accurate regression model can be constructed under a given labeling budget. Representativeness and diversity, which involve computing the distances among different samples, are important considerations in ALR. However, previous ALR approaches do not incorporate the importance of different features in inter-sample distance computation, resulting in sub-optimal sample selection. This paper proposes three feature weighted single-task ALR approaches and two feature weighted multi-task ALR approaches, where the ridge regression coefficients trained from a small amount of previously labeled samples are used to weight the corresponding features in inter-sample distance computation. Experiments showed that this easy-to-implement enhancement almost always improves the performance of four existing ALR approaches, in both single-task and multi-task regression problems. The feature weighting strategy may also be easily extended to stream-based ALR, and classification algorithms.

Paper Structure

This paper contains 31 sections, 13 equations, 10 figures, 2 tables, 9 algorithms.

Figures (10)

  • Figure 1: Performance of the nine single-task ALR algorithms on the 11 datasets, averaged over 100 runs. (a) Yacht; (b) autoMPG; (c) NO2; (d) PM110; (e) Housing; (f) CPS; (g) EE-Cooling; (h) Concrete; (i) Airfoil; (j) Wine-red; and, (k) Wine-white.
  • Figure 2: Normalized AUCs of the nine single-task ALR algorithms on the 11 datasets. (a) RMSE; and, (b) CC. The last group shows the average over the 11 datasets.
  • Figure 3: Performance of the seven single-task ALR algorithms, averaged over 100 runs. (a) Yacht; (b) autoMPG; and, (c) NO2. Two redundant features were introduced to each dataset.
  • Figure 4: Performance of the seven single-task ALR algorithms on the EE-Cooling dataset with $\lambda\in\{0.01,0.1,1,10\}$, averaged over 100 runs.
  • Figure 5: Effect of different number of initially labeled samples, $n_{\min}$ (averaged over 100 runs). (a) Yacht, where $d=6$; and, (b) autoMPG, where $d=9$.
  • ...and 5 more figures