Table of Contents
Fetching ...

Kernel Density Estimation for Multiclass Quantification

Alejandro Moreo, Pablo González, Juan José del Coz

TL;DR

This work tackles the problem of quantifying class prevalence under prior probability shift by replacing histograms with a multivariate kernel density estimation (KDE) representation of posterior probabilities on the unit simplex. It introduces KDEy, with variants KDEy-HD, KDEy-CS (closed-form), and KDEy-ML, that model a mixture of class-conditional densities via KDE/GMM on the simplex and unify distribution matching and maximum-likelihood quantification. Across Tweets, UCI-multi, and LeQua-T1B datasets, KDEy variants consistently outperform histogram-based DM methods and often surpass EMQ, with KDEy-ML delivering the strongest overall performance and robustness. The approach demonstrates that preserving inter-class correlations through a multivariate density on the posterior simplex yields substantial gains in multiclass prevalence estimation and extends gracefully to binary quantification as well.

Abstract

Several disciplines, like the social sciences, epidemiology, sentiment analysis, or market research, are interested in knowing the distribution of the classes in a population rather than the individual labels of the members thereof. Quantification is the supervised machine learning task concerned with obtaining accurate predictors of class prevalence, and to do so particularly in the presence of label shift. The distribution-matching (DM) approaches represent one of the most important families among the quantification methods that have been proposed in the literature so far. Current DM approaches model the involved populations by means of histograms of posterior probabilities. In this paper, we argue that their application to the multiclass setting is suboptimal since the histograms become class-specific, thus missing the opportunity to model inter-class information that may exist in the data. We propose a new representation mechanism based on multivariate densities that we model via kernel density estimation (KDE). The experiments we have carried out show our method, dubbed KDEy, yields superior quantification performance with respect to previous DM approaches. We also investigate the KDE-based representation within the maximum likelihood framework and show KDEy often shows superior performance with respect to the expectation-maximization method for quantification, arguably the strongest contender in the quantification arena to date.

Kernel Density Estimation for Multiclass Quantification

TL;DR

This work tackles the problem of quantifying class prevalence under prior probability shift by replacing histograms with a multivariate kernel density estimation (KDE) representation of posterior probabilities on the unit simplex. It introduces KDEy, with variants KDEy-HD, KDEy-CS (closed-form), and KDEy-ML, that model a mixture of class-conditional densities via KDE/GMM on the simplex and unify distribution matching and maximum-likelihood quantification. Across Tweets, UCI-multi, and LeQua-T1B datasets, KDEy variants consistently outperform histogram-based DM methods and often surpass EMQ, with KDEy-ML delivering the strongest overall performance and robustness. The approach demonstrates that preserving inter-class correlations through a multivariate density on the posterior simplex yields substantial gains in multiclass prevalence estimation and extends gracefully to binary quantification as well.

Abstract

Several disciplines, like the social sciences, epidemiology, sentiment analysis, or market research, are interested in knowing the distribution of the classes in a population rather than the individual labels of the members thereof. Quantification is the supervised machine learning task concerned with obtaining accurate predictors of class prevalence, and to do so particularly in the presence of label shift. The distribution-matching (DM) approaches represent one of the most important families among the quantification methods that have been proposed in the literature so far. Current DM approaches model the involved populations by means of histograms of posterior probabilities. In this paper, we argue that their application to the multiclass setting is suboptimal since the histograms become class-specific, thus missing the opportunity to model inter-class information that may exist in the data. We propose a new representation mechanism based on multivariate densities that we model via kernel density estimation (KDE). The experiments we have carried out show our method, dubbed KDEy, yields superior quantification performance with respect to previous DM approaches. We also investigate the KDE-based representation within the maximum likelihood framework and show KDEy often shows superior performance with respect to the expectation-maximization method for quantification, arguably the strongest contender in the quantification arena to date.
Paper Structure (21 sections, 36 equations, 3 figures, 10 tables)

This paper contains 21 sections, 36 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: DM in the binary case. Left: the distributions of the posterior probabilities of "being positive" are modelled for the positive (red) and negative (blue) examples in the training data. Right: the distribution of the posterior probabilities of "being positive" is modelled for the test examples (violet), and the parameter of a mixture of positives and negatives (green) yielding the best approximation of the test distribution is sought.
  • Figure 2: Problem representations obtained with traditional class-wise histograms (first row) and with our proposed mechanism based on GMMs (second row) on a 3-class sentiment problem (the dataset is called "wb" and belongs to the "Tweets" group described later in Section \ref{['sec:datasets']}). The first three columns (A,B,C) show the model representations for the training sets $L_1$, $L_2$, and $L_3$, while the last column (D) shows the representation for the test set $U$. The quantification problem is framed as the task of reconstructing (D) as a convex linear combination of (A,B,C).
  • Figure 3: Sensitivity of DM-HD (left) and KDEy-ML (right) to the hyperparameters, number of bins ("nbins") and bandwidth, respectively.