Automated Research Article Classification and Recommendation Using NLP and ML
Shadikur Rahman, Hasibul Karim Shanto, Umme Ayman Koana, Syed Muhammad Danish
TL;DR
The paper tackles information overload in scholarly literature by proposing an automated NLP/ML framework that jointly classifies articles and recommends related work. It benchmarks multiple text representations (TF-IDF, Count Vectorizer, Sentence-BERT, USE, Mirror-BERT) across several classifiers (LR, SVM, NB, RF, GBRT, kNN) on a large arXiv corpus, finding that Logistic Regression with TF-IDF achieves the best accuracy at 0.69. A cosine-similarity–based recommender operates on vectorized article representations to retrieve related papers, enabling scalable literature discovery. The work demonstrates a data-driven, end-to-end approach that improves efficiency in navigating vast digital libraries and provides a foundation for future enhancements with larger datasets and transformer-based models.
Abstract
In the digital era, the exponential growth of scientific publications has made it increasingly difficult for researchers to efficiently identify and access relevant work. This paper presents an automated framework for research article classification and recommendation that leverages Natural Language Processing (NLP) techniques and machine learning. Using a large-scale arXiv.org dataset spanning more than three decades, we evaluate multiple feature extraction approaches (TF--IDF, Count Vectorizer, Sentence-BERT, USE, Mirror-BERT) in combination with diverse machine learning classifiers (Logistic Regression, SVM, Naïve Bayes, Random Forest, Gradient Boosted Trees, and k-Nearest Neighbour). Our experiments show that Logistic Regression with TF--IDF consistently yields the best classification performance, achieving an accuracy of 69\%. To complement classification, we incorporate a recommendation module based on the cosine similarity of vectorized articles, enabling efficient retrieval of related research papers. The proposed system directly addresses the challenge of information overload in digital libraries and demonstrates a scalable, data-driven solution to support literature discovery.
