Document Set Expansion with Positive-Unlabeled Learning: A Density Estimation-based Approach
Haiyang Zhang, Qiuyi Chen, Yuanjie Zou, Yushan Pan, Jia Wang, Mark Stevenson
TL;DR
This work tackles Document Set Expansion (DSE) by reframing PU learning with density estimation, eliminating reliance on the SCAR assumption and prior knowledge of class balance. The proposed puDE framework uses two density estimators to model $P(X|Y=+1)$ and $P(X)$, yielding $P(Y=+1|X)$ via the density ratio $f(x) = \frac{p(x)\pi}{q(x)}$, with implementations in nonparametric KDE (augmented by VAE dimensionality reduction) and parametric Energy-Based Models (EBMs) trained via MCMC. Experimental results on PubMed DSE datasets and a Covid study dataset show puDE variants outperform transductive nnPU baselines and BM25 baselines, highlighting robustness to label sparsity and transductive evaluation. The methods advance practical DSE by enabling effective topic-driven document screening without requiring SCAR or known priors, supporting scalable literature curation and similar tasks.
Abstract
Document set expansion aims to identify relevant documents from a large collection based on a small set of documents that are on a fine-grained topic. Previous work shows that PU learning is a promising method for this task. However, some serious issues remain unresolved, i.e. typical challenges that PU methods suffer such as unknown class prior and imbalanced data, and the need for transductive experimental settings. In this paper, we propose a novel PU learning framework based on density estimation, called puDE, that can handle the above issues. The advantage of puDE is that it neither constrained to the SCAR assumption and nor require any class prior knowledge. We demonstrate the effectiveness of the proposed method using a series of real-world datasets and conclude that our method is a better alternative for the DSE task.
