Document Set Expansion with Positive-Unlabeled Learning: A Density Estimation-based Approach

Haiyang Zhang; Qiuyi Chen; Yuanjie Zou; Yushan Pan; Jia Wang; Mark Stevenson

Document Set Expansion with Positive-Unlabeled Learning: A Density Estimation-based Approach

Haiyang Zhang, Qiuyi Chen, Yuanjie Zou, Yushan Pan, Jia Wang, Mark Stevenson

TL;DR

This work tackles Document Set Expansion (DSE) by reframing PU learning with density estimation, eliminating reliance on the SCAR assumption and prior knowledge of class balance. The proposed puDE framework uses two density estimators to model $P(X|Y=+1)$ and $P(X)$, yielding $P(Y=+1|X)$ via the density ratio $f(x) = \frac{p(x)\pi}{q(x)}$, with implementations in nonparametric KDE (augmented by VAE dimensionality reduction) and parametric Energy-Based Models (EBMs) trained via MCMC. Experimental results on PubMed DSE datasets and a Covid study dataset show puDE variants outperform transductive nnPU baselines and BM25 baselines, highlighting robustness to label sparsity and transductive evaluation. The methods advance practical DSE by enabling effective topic-driven document screening without requiring SCAR or known priors, supporting scalable literature curation and similar tasks.

Abstract

Document set expansion aims to identify relevant documents from a large collection based on a small set of documents that are on a fine-grained topic. Previous work shows that PU learning is a promising method for this task. However, some serious issues remain unresolved, i.e. typical challenges that PU methods suffer such as unknown class prior and imbalanced data, and the need for transductive experimental settings. In this paper, we propose a novel PU learning framework based on density estimation, called puDE, that can handle the above issues. The advantage of puDE is that it neither constrained to the SCAR assumption and nor require any class prior knowledge. We demonstrate the effectiveness of the proposed method using a series of real-world datasets and conclude that our method is a better alternative for the DSE task.

Document Set Expansion with Positive-Unlabeled Learning: A Density Estimation-based Approach

TL;DR

and

, yielding

via the density ratio

, with implementations in nonparametric KDE (augmented by VAE dimensionality reduction) and parametric Energy-Based Models (EBMs) trained via MCMC. Experimental results on PubMed DSE datasets and a Covid study dataset show puDE variants outperform transductive nnPU baselines and BM25 baselines, highlighting robustness to label sparsity and transductive evaluation. The methods advance practical DSE by enabling effective topic-driven document screening without requiring SCAR or known priors, supporting scalable literature curation and similar tasks.

Abstract

Paper Structure (11 sections, 8 equations, 1 figure, 2 tables)

This paper contains 11 sections, 8 equations, 1 figure, 2 tables.

Introduction
Preliminary
Proposed Methods
Task Formulation
PU Learning with Density Estimation
Nonparametric Density Estimation
Parametric Density Estimation
Experiment
Settings
Results
Conclusion

Figures (1)

Figure 1: F1 comparison on covid dataset with respect to the ratio of |LP| over |U| ranging from 0.01 to 1.

Document Set Expansion with Positive-Unlabeled Learning: A Density Estimation-based Approach

TL;DR

Abstract

Document Set Expansion with Positive-Unlabeled Learning: A Density Estimation-based Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (1)