Table of Contents
Fetching ...

Bayesian Active Learning for Classification and Preference Learning

Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, Máté Lengyel

TL;DR

The paper introduces BALD, a Bayesian information-theoretic active learning objective, and shows how to apply it with Gaussian Process classifiers by reformulating entropy-based gains in the output space. It derives analytic, near-exact expressions for the BALD criterion under probit likelihoods and demonstrates how to extend the approach to GP-based preference learning via a difference kernel. Empirical results on classification and preference tasks indicate BALD often outperforms other active-learning methods while maintaining low computational complexity, and the method remains agnostic to the underlying approximate inference technique. The work also discusses hyperparameter learning within BALD and situates the approach relative to related methodologies, highlighting practical advantages for nonparametric models.

Abstract

Information theoretic active learning has been widely studied for probabilistic models. For simple regression an optimal myopic policy is easily tractable. However, for other tasks and with more complex models, such as classification with nonparametric models, the optimal solution is harder to compute. Current approaches make approximations to achieve tractability. We propose an approach that expresses information gain in terms of predictive entropies, and apply this method to the Gaussian Process Classifier (GPC). Our approach makes minimal approximations to the full information theoretic objective. Our experimental performance compares favourably to many popular active learning algorithms, and has equal or lower computational complexity. We compare well to decision theoretic approaches also, which are privy to more information and require much more computational time. Secondly, by developing further a reformulation of binary preference learning to a classification problem, we extend our algorithm to Gaussian Process preference learning.

Bayesian Active Learning for Classification and Preference Learning

TL;DR

The paper introduces BALD, a Bayesian information-theoretic active learning objective, and shows how to apply it with Gaussian Process classifiers by reformulating entropy-based gains in the output space. It derives analytic, near-exact expressions for the BALD criterion under probit likelihoods and demonstrates how to extend the approach to GP-based preference learning via a difference kernel. Empirical results on classification and preference tasks indicate BALD often outperforms other active-learning methods while maintaining low computational complexity, and the method remains agnostic to the underlying approximate inference technique. The work also discusses hyperparameter learning within BALD and situates the approach relative to related methodologies, highlighting practical advantages for nonparametric models.

Abstract

Information theoretic active learning has been widely studied for probabilistic models. For simple regression an optimal myopic policy is easily tractable. However, for other tasks and with more complex models, such as classification with nonparametric models, the optimal solution is harder to compute. Current approaches make approximations to achieve tractability. We propose an approach that expresses information gain in terms of predictive entropies, and apply this method to the Gaussian Process Classifier (GPC). Our approach makes minimal approximations to the full information theoretic objective. Our experimental performance compares favourably to many popular active learning algorithms, and has equal or lower computational complexity. We compare well to decision theoretic approaches also, which are privy to more information and require much more computational time. Secondly, by developing further a reformulation of binary preference learning to a classification problem, we extend our algorithm to Gaussian Process preference learning.

Paper Structure

This paper contains 15 sections, 13 equations, 5 figures.

Figures (5)

  • Figure 1: Analytic approximation ($\stackrel{1}{\approx}$) to the binary entropy of the error function (\ref{['plots:approx_true']}) by a squared exponential (\ref{['plots:approx_approx']}). The absolute error (\ref{['plots:approx_error']}) remains under $3\cdot 10^{-3}$.
  • Figure 2: Percentage approximation error ($\pm$1 s.d.) for different methods of approximate inference (columns) and approximation methods for evaluating Eqn. \ref{['eqn:mean_entropy']} (rows). The results indicate that $\stackrel{2}{\approx}$ is a very accurate approximation; EP causes some loss and Laplace significantly more, which is in line with the comparison presented in Kuss05. For our experiments we use EP.
  • Figure 3: Top: Evaluation on artificial datasets. Exemplars of the two classes are shown with black squares (\ref{['plots:positives']}) and red circles (\ref{['plots:negatives']}). Bottom: Results of active learning with nine methods: random query (\ref{['plots:rand']}), BALD (\ref{['plots:BALD']}), MES (\ref{['plots:maxent']}), QBC with the vote criterion with 2 (\ref{['plots:QBC2']}) and 100 (\ref{['plots:QBC100']}) committee members, active SVM (\ref{['plots:SVM']}), IVM (\ref{['plots:IVM']}), decision theoretic: kapoor2007 (\ref{['plots:dec']}), zhu2003 (\ref{['plots:semi']}) and empirical error (\ref{['plots:emp']}).
  • Figure 4: Test set classification accuracy on classification and preference learning datasets. Methods used are BALD (\ref{['plots:BALD']}), random query (\ref{['plots:rand']}), MES (\ref{['plots:maxent']}), QBC with 2 ($\hbox{QBC}_2$, \ref{['plots:QBC2']}) and 100 ($\hbox{QBC}_{100}$, \ref{['plots:QBC100']}) committee members, active SVM (\ref{['plots:SVM']}), IVM (\ref{['plots:IVM']}), decision theoretic kapoor2007 (\ref{['plots:dec']}), decision theoretic zhu2003 (\ref{['plots:semi']}) and empicial error (\ref{['plots:emp']}). The decision theoretic methods took a long time to run, so were not completed for all datasets. Plots (a-i) are GPC datasets, (j-l) are preference learning.
  • Figure 5: Summary of results for all classification experiments. $y$-axis denotes the number of additional data points, relative to BALD, required to achieve at least $97.5\%$ of the predictive performance of the entire pool. The 'box' denotes 25th to 75th percentile, the red line denotes the median over datasets, and the 'whiskers' depict the range. The crosses denote outliers ($>2.7\sigma$ from the mean). Positive values mean that the algorithm required more data points than BALD to achieve the same performance.