Table of Contents
Fetching ...

Information Retrieval with Entity Linking

Dahlia Shehata

TL;DR

This work investigates enhancing information retrieval by expanding both queries and documents with linked entities to mitigate vocabulary gaps and semantic misalignment. It demonstrates that explicit entity expansion substantially improves recall at the first retrieval stage for sparse BM25-based pipelines on MS MARCO, with complementary gains visible through run fusion and oracle analyses. However, when applied to a state-of-the-art dense retrieval pipeline (STAR-ADORE), entity augmentation yields only a neutral effect, suggesting domain- and method-dependent benefits. The study provides a rigorous comparison across multiple relevance judgments (Original, MonoT5, DuoT5) and highlights practical, scalable strategies for improving early-stage recall in cascaded IR systems, with clear directions for future work including more diverse dense models and expanded entity representations.

Abstract

Despite the advantages of their low-resource settings, traditional sparse retrievers depend on exact matching approaches between high-dimensional bag-of-words (BoW) representations of both the queries and the collection. As a result, retrieval performance is restricted by semantic discrepancies and vocabulary gaps. On the other hand, transformer-based dense retrievers introduce significant improvements in information retrieval tasks by exploiting low-dimensional contextualized representations of the corpus. While dense retrievers are known for their relative effectiveness, they suffer from lower efficiency and lack of generalization issues, when compared to sparse retrievers. For a lightweight retrieval task, high computational resources and time consumption are major barriers encouraging the renunciation of dense models despite potential gains. In this work, I propose boosting the performance of sparse retrievers by expanding both the queries and the documents with linked entities in two formats for the entity names: 1) explicit and 2) hashed. A zero-shot end-to-end dense entity linking system is employed for entity recognition and disambiguation to augment the corpus. By leveraging the advanced entity linking methods, I believe that the effectiveness gap between sparse and dense retrievers can be narrowed. Experiments are conducted on the MS MARCO passage dataset using the original qrel set, the re-ranked qrels favoured by MonoT5 and the latter set further re-ranked by DuoT5. Since I am concerned with the early stage retrieval in cascaded ranking architectures of large information retrieval systems, the results are evaluated using recall@1000. The suggested approach is also capable of retrieving documents for query subsets judged to be particularly difficult in prior work.

Information Retrieval with Entity Linking

TL;DR

This work investigates enhancing information retrieval by expanding both queries and documents with linked entities to mitigate vocabulary gaps and semantic misalignment. It demonstrates that explicit entity expansion substantially improves recall at the first retrieval stage for sparse BM25-based pipelines on MS MARCO, with complementary gains visible through run fusion and oracle analyses. However, when applied to a state-of-the-art dense retrieval pipeline (STAR-ADORE), entity augmentation yields only a neutral effect, suggesting domain- and method-dependent benefits. The study provides a rigorous comparison across multiple relevance judgments (Original, MonoT5, DuoT5) and highlights practical, scalable strategies for improving early-stage recall in cascaded IR systems, with clear directions for future work including more diverse dense models and expanded entity representations.

Abstract

Despite the advantages of their low-resource settings, traditional sparse retrievers depend on exact matching approaches between high-dimensional bag-of-words (BoW) representations of both the queries and the collection. As a result, retrieval performance is restricted by semantic discrepancies and vocabulary gaps. On the other hand, transformer-based dense retrievers introduce significant improvements in information retrieval tasks by exploiting low-dimensional contextualized representations of the corpus. While dense retrievers are known for their relative effectiveness, they suffer from lower efficiency and lack of generalization issues, when compared to sparse retrievers. For a lightweight retrieval task, high computational resources and time consumption are major barriers encouraging the renunciation of dense models despite potential gains. In this work, I propose boosting the performance of sparse retrievers by expanding both the queries and the documents with linked entities in two formats for the entity names: 1) explicit and 2) hashed. A zero-shot end-to-end dense entity linking system is employed for entity recognition and disambiguation to augment the corpus. By leveraging the advanced entity linking methods, I believe that the effectiveness gap between sparse and dense retrievers can be narrowed. Experiments are conducted on the MS MARCO passage dataset using the original qrel set, the re-ranked qrels favoured by MonoT5 and the latter set further re-ranked by DuoT5. Since I am concerned with the early stage retrieval in cascaded ranking architectures of large information retrieval systems, the results are evaluated using recall@1000. The suggested approach is also capable of retrieving documents for query subsets judged to be particularly difficult in prior work.
Paper Structure (69 sections, 3 equations, 14 figures, 16 tables)

This paper contains 69 sections, 3 equations, 14 figures, 16 tables.

Figures (14)

  • Figure 2: Dense Retriever pipeline using STAR and ADORE (adapted from https://github.com/jingtaozhan/DRhardGithub's documentation of the model implementation)
  • Figure 3: Overview of the Classifier Architecture. (adapted from arabzadeh2021predicting)
  • Figure 4: Recall curves of the Dev query set with respect to the original qrels. The x-axis shows the cutoffs, and the y-axis is the corresponding recall value.
  • Figure 5: Recall curves of the Hard query set with respect to the original qrels. The x-axis shows the cutoffs, and the y-axis is the corresponding recall value.
  • Figure 6: Recall curves of the Harder query set with respect to the original qrels. The x-axis shows the cutoffs, and the y-axis is the corresponding recall value.
  • ...and 9 more figures