Table of Contents
Fetching ...

Using Word Embeddings for Automatic Query Expansion

Dwaipayan Roy, Debjyoti Paul, Mandar Mitra, Utpal Garain

TL;DR

"The paper investigates using word2vec word embeddings for automatic query expansion in ad hoc retrieval. It proposes multiple kNN-based QE strategies and an extended query term set that leverage embedding space and compositionality, then evaluates against a strong baseline and RM3 across standard datasets. The results show that while embedding-based QE improves over raw-query retrieval, it generally underperforms RM3, and composition helps; the discussion highlights the complementary strengths of co-occurrence statistics and embedding similarity. The work suggests combining these approaches and exploring localized training to enhance practical effectiveness."

Abstract

In this paper a framework for Automatic Query Expansion (AQE) is proposed using distributed neural language model word2vec. Using semantic and contextual relation in a distributed and unsupervised framework, word2vec learns a low dimensional embedding for each vocabulary entry. Using such a framework, we devise a query expansion technique, where related terms to a query are obtained by K-nearest neighbor approach. We explore the performance of the AQE methods, with and without feedback query expansion, and a variant of simple K-nearest neighbor in the proposed framework. Experiments on standard TREC ad-hoc data (Disk 4, 5 with query sets 301-450, 601-700) and web data (WT10G data with query set 451-550) shows significant improvement over standard term-overlapping based retrieval methods. However the proposed method fails to achieve comparable performance with statistical co-occurrence based feedback method such as RM3. We have also found that the word2vec based query expansion methods perform similarly with and without any feedback information.

Using Word Embeddings for Automatic Query Expansion

TL;DR

"The paper investigates using word2vec word embeddings for automatic query expansion in ad hoc retrieval. It proposes multiple kNN-based QE strategies and an extended query term set that leverage embedding space and compositionality, then evaluates against a strong baseline and RM3 across standard datasets. The results show that while embedding-based QE improves over raw-query retrieval, it generally underperforms RM3, and composition helps; the discussion highlights the complementary strengths of co-occurrence statistics and embedding similarity. The work suggests combining these approaches and exploring localized training to enhance practical effectiveness."

Abstract

In this paper a framework for Automatic Query Expansion (AQE) is proposed using distributed neural language model word2vec. Using semantic and contextual relation in a distributed and unsupervised framework, word2vec learns a low dimensional embedding for each vocabulary entry. Using such a framework, we devise a query expansion technique, where related terms to a query are obtained by K-nearest neighbor approach. We explore the performance of the AQE methods, with and without feedback query expansion, and a variant of simple K-nearest neighbor in the proposed framework. Experiments on standard TREC ad-hoc data (Disk 4, 5 with query sets 301-450, 601-700) and web data (WT10G data with query set 451-550) shows significant improvement over standard term-overlapping based retrieval methods. However the proposed method fails to achieve comparable performance with statistical co-occurrence based feedback method such as RM3. We have also found that the word2vec based query expansion methods perform similarly with and without any feedback information.

Paper Structure

This paper contains 12 sections, 5 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Difference in AP for individual queries.