Kernel Density Estimation for Multiclass Quantification
Alejandro Moreo, Pablo González, Juan José del Coz
TL;DR
This work tackles the problem of quantifying class prevalence under prior probability shift by replacing histograms with a multivariate kernel density estimation (KDE) representation of posterior probabilities on the unit simplex. It introduces KDEy, with variants KDEy-HD, KDEy-CS (closed-form), and KDEy-ML, that model a mixture of class-conditional densities via KDE/GMM on the simplex and unify distribution matching and maximum-likelihood quantification. Across Tweets, UCI-multi, and LeQua-T1B datasets, KDEy variants consistently outperform histogram-based DM methods and often surpass EMQ, with KDEy-ML delivering the strongest overall performance and robustness. The approach demonstrates that preserving inter-class correlations through a multivariate density on the posterior simplex yields substantial gains in multiclass prevalence estimation and extends gracefully to binary quantification as well.
Abstract
Several disciplines, like the social sciences, epidemiology, sentiment analysis, or market research, are interested in knowing the distribution of the classes in a population rather than the individual labels of the members thereof. Quantification is the supervised machine learning task concerned with obtaining accurate predictors of class prevalence, and to do so particularly in the presence of label shift. The distribution-matching (DM) approaches represent one of the most important families among the quantification methods that have been proposed in the literature so far. Current DM approaches model the involved populations by means of histograms of posterior probabilities. In this paper, we argue that their application to the multiclass setting is suboptimal since the histograms become class-specific, thus missing the opportunity to model inter-class information that may exist in the data. We propose a new representation mechanism based on multivariate densities that we model via kernel density estimation (KDE). The experiments we have carried out show our method, dubbed KDEy, yields superior quantification performance with respect to previous DM approaches. We also investigate the KDE-based representation within the maximum likelihood framework and show KDEy often shows superior performance with respect to the expectation-maximization method for quantification, arguably the strongest contender in the quantification arena to date.
