Table of Contents
Fetching ...

NLP-Powered Repository and Search Engine for Academic Papers: A Case Study on Cyber Risk Literature with CyLit

Linfeng Zhang, Changyue Hu, Zhiyu Quan

TL;DR

CyLit addresses the challenge of locating and contextualizing cyber risk literature amid rapid growth by delivering a living NLP-powered repository and search tool. It integrates data collection from Scopus, KeyBERT-based keyword extraction, Sentence-BERT embeddings, K-means keyword clustering, and Apriori association analysis, all exposed through a Django/React web interface and FAISS-based semantic search. The system yields 30 keyword clusters from a large domain keyword library and supports trend visualization and cross-topic insights, offering a complementary perspective to manual reviews and general LLM-based approaches. Compared with human judgments and ChatGPT experiments, CyLit demonstrates scalable, domain-tuned categorization and living literature capabilities that can accelerate actuarial and cyber risk research, with future directions including LLM fine-tuning and broader domain deployment.

Abstract

As the body of academic literature continues to grow, researchers face increasing difficulties in effectively searching for relevant resources. Existing databases and search engines often fall short of providing a comprehensive and contextually relevant collection of academic literature. To address this issue, we propose a novel framework that leverages Natural Language Processing (NLP) techniques. This framework automates the retrieval, summarization, and clustering of academic literature within a specific research domain. To demonstrate the effectiveness of our approach, we introduce CyLit, an NLP-powered repository specifically designed for the cyber risk literature. CyLit empowers researchers by providing access to context-specific resources and enabling the tracking of trends in the dynamic and rapidly evolving field of cyber risk. Through the automatic processing of large volumes of data, our NLP-powered solution significantly enhances the efficiency and specificity of academic literature searches. We compare the literature categorization results of CyLit to those presented in survey papers or generated by ChatGPT, highlighting the distinctive insights this tool provides into cyber risk research literature. Using NLP techniques, we aim to revolutionize the way researchers discover, analyze, and utilize academic resources, ultimately fostering advancements in various domains of knowledge.

NLP-Powered Repository and Search Engine for Academic Papers: A Case Study on Cyber Risk Literature with CyLit

TL;DR

CyLit addresses the challenge of locating and contextualizing cyber risk literature amid rapid growth by delivering a living NLP-powered repository and search tool. It integrates data collection from Scopus, KeyBERT-based keyword extraction, Sentence-BERT embeddings, K-means keyword clustering, and Apriori association analysis, all exposed through a Django/React web interface and FAISS-based semantic search. The system yields 30 keyword clusters from a large domain keyword library and supports trend visualization and cross-topic insights, offering a complementary perspective to manual reviews and general LLM-based approaches. Compared with human judgments and ChatGPT experiments, CyLit demonstrates scalable, domain-tuned categorization and living literature capabilities that can accelerate actuarial and cyber risk research, with future directions including LLM fine-tuning and broader domain deployment.

Abstract

As the body of academic literature continues to grow, researchers face increasing difficulties in effectively searching for relevant resources. Existing databases and search engines often fall short of providing a comprehensive and contextually relevant collection of academic literature. To address this issue, we propose a novel framework that leverages Natural Language Processing (NLP) techniques. This framework automates the retrieval, summarization, and clustering of academic literature within a specific research domain. To demonstrate the effectiveness of our approach, we introduce CyLit, an NLP-powered repository specifically designed for the cyber risk literature. CyLit empowers researchers by providing access to context-specific resources and enabling the tracking of trends in the dynamic and rapidly evolving field of cyber risk. Through the automatic processing of large volumes of data, our NLP-powered solution significantly enhances the efficiency and specificity of academic literature searches. We compare the literature categorization results of CyLit to those presented in survey papers or generated by ChatGPT, highlighting the distinctive insights this tool provides into cyber risk research literature. Using NLP techniques, we aim to revolutionize the way researchers discover, analyze, and utilize academic resources, ultimately fostering advancements in various domains of knowledge.
Paper Structure (27 sections, 19 equations, 8 figures, 4 tables)

This paper contains 27 sections, 19 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: NLP-powered literature system
  • Figure 2: The workflow of keyword extraction algorithm
  • Figure 3: The workflow of keyword clustering and association analysis between clusters
  • Figure 4: The workflow of semantic search
  • Figure 5: CyLit system structure illustration
  • ...and 3 more figures