Table of Contents
Fetching ...

Iterative NLP Query Refinement for Enhancing Domain-Specific Information Retrieval: A Case Study in Career Services

Elham Peimani, Gurpreet Singh, Nisarg Mahyavanshi, Aman Arora, Awais Shaikh

TL;DR

The paper addresses poor retrieval performance for niche-domain information needs by applying an iterative, semi-automated query refinement workflow to TF-IDF-based retrieval on Humber College's career services pages. Domain-specific term extraction from top results and structured descriptors are used to expand queries, with automated scripts enabling scalable refinement. On five benchmark queries, the approach elevates average top-document similarity from about 0.18 to 0.42, with significant improvements confirmed by a paired t-test (t = -2.9444, p = 0.0422). The work provides a reproducible framework and codebase, and discusses extensions to incorporate neural retrieval models for further gains.

Abstract

Retrieving semantically relevant documents in niche domains poses significant challenges for traditional TF-IDF-based systems, often resulting in low similarity scores and suboptimal retrieval performance. This paper addresses these challenges by introducing an iterative and semi-automated query refinement methodology tailored to Humber College's career services webpages. Initially, generic queries related to interview preparation yield low top-document similarities (approximately 0.2--0.3). To enhance retrieval effectiveness, we implement a two-fold approach: first, domain-aware query refinement by incorporating specialized terms such as resources-online-learning, student-online-services, and career-advising; second, the integration of structured educational descriptors like "online resume and interview improvement tools." Additionally, we automate the extraction of domain-specific keywords from top-ranked documents to suggest relevant terms for query expansion. Through experiments conducted on five baseline queries, our semi-automated iterative refinement process elevates the average top similarity score from approximately 0.18 to 0.42, marking a substantial improvement in retrieval performance. The implementation details, including reproducible code and experimental setups, are made available in our GitHub repositories \url{https://github.com/Elipei88/HumberChatbotBackend} and \url{https://github.com/Nisarg851/HumberChatbot}. We also discuss the limitations of our approach and propose future directions, including the integration of advanced neural retrieval models.

Iterative NLP Query Refinement for Enhancing Domain-Specific Information Retrieval: A Case Study in Career Services

TL;DR

The paper addresses poor retrieval performance for niche-domain information needs by applying an iterative, semi-automated query refinement workflow to TF-IDF-based retrieval on Humber College's career services pages. Domain-specific term extraction from top results and structured descriptors are used to expand queries, with automated scripts enabling scalable refinement. On five benchmark queries, the approach elevates average top-document similarity from about 0.18 to 0.42, with significant improvements confirmed by a paired t-test (t = -2.9444, p = 0.0422). The work provides a reproducible framework and codebase, and discusses extensions to incorporate neural retrieval models for further gains.

Abstract

Retrieving semantically relevant documents in niche domains poses significant challenges for traditional TF-IDF-based systems, often resulting in low similarity scores and suboptimal retrieval performance. This paper addresses these challenges by introducing an iterative and semi-automated query refinement methodology tailored to Humber College's career services webpages. Initially, generic queries related to interview preparation yield low top-document similarities (approximately 0.2--0.3). To enhance retrieval effectiveness, we implement a two-fold approach: first, domain-aware query refinement by incorporating specialized terms such as resources-online-learning, student-online-services, and career-advising; second, the integration of structured educational descriptors like "online resume and interview improvement tools." Additionally, we automate the extraction of domain-specific keywords from top-ranked documents to suggest relevant terms for query expansion. Through experiments conducted on five baseline queries, our semi-automated iterative refinement process elevates the average top similarity score from approximately 0.18 to 0.42, marking a substantial improvement in retrieval performance. The implementation details, including reproducible code and experimental setups, are made available in our GitHub repositories \url{https://github.com/Elipei88/HumberChatbotBackend} and \url{https://github.com/Nisarg851/HumberChatbot}. We also discuss the limitations of our approach and propose future directions, including the integration of advanced neural retrieval models.

Paper Structure

This paper contains 17 sections, 2 equations, 1 figure, 1 table, 1 algorithm.

Figures (1)

  • Figure 1: Comparison of baseline and refined top similarities for five queries.