Table of Contents
Fetching ...

Semantic Search and Recommendation Algorithm

Aryan Duhan, Aryan Singhal, Shourya Sharma, Neeraj, Arti MK

TL;DR

This paper presents a semantic search framework that combines Word2Vec embeddings with the Annoy Index to enable fast, scalable, and context-aware retrieval over large datasets. By transforming queries and documents into dense vector representations and using approximate nearest-neighbor search, the approach achieves higher accuracy and significantly reduced latency compared to traditional keyword-based methods. Empirical results on datasets up to 100 GB demonstrate improved precision, recall, and F1 scores alongside efficient resource utilization, confirming the method’s suitability for real-time applications in domains like healthcare, e-commerce, and academia. The work highlights practical implications for large-scale information retrieval and outlines future enhancements, including robustness to data quality and exploration of alternative ANN techniques.

Abstract

This paper introduces a new semantic search algorithm that uses Word2Vec and Annoy Index to improve the efficiency of information retrieval from large datasets. The proposed approach addresses the limitations of traditional search methods by offering enhanced speed, accuracy, and scalability. Testing on datasets up to 100GB demonstrates the method's effectiveness in processing vast amounts of data while maintaining high precision and performance.

Semantic Search and Recommendation Algorithm

TL;DR

This paper presents a semantic search framework that combines Word2Vec embeddings with the Annoy Index to enable fast, scalable, and context-aware retrieval over large datasets. By transforming queries and documents into dense vector representations and using approximate nearest-neighbor search, the approach achieves higher accuracy and significantly reduced latency compared to traditional keyword-based methods. Empirical results on datasets up to 100 GB demonstrate improved precision, recall, and F1 scores alongside efficient resource utilization, confirming the method’s suitability for real-time applications in domains like healthcare, e-commerce, and academia. The work highlights practical implications for large-scale information retrieval and outlines future enhancements, including robustness to data quality and exploration of alternative ANN techniques.

Abstract

This paper introduces a new semantic search algorithm that uses Word2Vec and Annoy Index to improve the efficiency of information retrieval from large datasets. The proposed approach addresses the limitations of traditional search methods by offering enhanced speed, accuracy, and scalability. Testing on datasets up to 100GB demonstrates the method's effectiveness in processing vast amounts of data while maintaining high precision and performance.

Paper Structure

This paper contains 19 sections, 4 figures.

Figures (4)

  • Figure 1: The flow of execution for our system involves several key stages, from data collection to model training, and ultimately to the deployment of the best-performing model.
  • Figure 2: Comparison of search accuracy between the proposed method and traditional search methods across various dataset sizes.
  • Figure 3: Response times of the semantic search engine compared to traditional methods, illustrating enhanced efficiency.
  • Figure 4: Scalability of the semantic search engine showing consistent performance across data sizes and efficient resource utilization.