Semantic Search and Recommendation Algorithm
Aryan Duhan, Aryan Singhal, Shourya Sharma, Neeraj, Arti MK
TL;DR
This paper presents a semantic search framework that combines Word2Vec embeddings with the Annoy Index to enable fast, scalable, and context-aware retrieval over large datasets. By transforming queries and documents into dense vector representations and using approximate nearest-neighbor search, the approach achieves higher accuracy and significantly reduced latency compared to traditional keyword-based methods. Empirical results on datasets up to 100 GB demonstrate improved precision, recall, and F1 scores alongside efficient resource utilization, confirming the method’s suitability for real-time applications in domains like healthcare, e-commerce, and academia. The work highlights practical implications for large-scale information retrieval and outlines future enhancements, including robustness to data quality and exploration of alternative ANN techniques.
Abstract
This paper introduces a new semantic search algorithm that uses Word2Vec and Annoy Index to improve the efficiency of information retrieval from large datasets. The proposed approach addresses the limitations of traditional search methods by offering enhanced speed, accuracy, and scalability. Testing on datasets up to 100GB demonstrates the method's effectiveness in processing vast amounts of data while maintaining high precision and performance.
