Information Retrieval with Entity Linking
Dahlia Shehata
TL;DR
This work investigates enhancing information retrieval by expanding both queries and documents with linked entities to mitigate vocabulary gaps and semantic misalignment. It demonstrates that explicit entity expansion substantially improves recall at the first retrieval stage for sparse BM25-based pipelines on MS MARCO, with complementary gains visible through run fusion and oracle analyses. However, when applied to a state-of-the-art dense retrieval pipeline (STAR-ADORE), entity augmentation yields only a neutral effect, suggesting domain- and method-dependent benefits. The study provides a rigorous comparison across multiple relevance judgments (Original, MonoT5, DuoT5) and highlights practical, scalable strategies for improving early-stage recall in cascaded IR systems, with clear directions for future work including more diverse dense models and expanded entity representations.
Abstract
Despite the advantages of their low-resource settings, traditional sparse retrievers depend on exact matching approaches between high-dimensional bag-of-words (BoW) representations of both the queries and the collection. As a result, retrieval performance is restricted by semantic discrepancies and vocabulary gaps. On the other hand, transformer-based dense retrievers introduce significant improvements in information retrieval tasks by exploiting low-dimensional contextualized representations of the corpus. While dense retrievers are known for their relative effectiveness, they suffer from lower efficiency and lack of generalization issues, when compared to sparse retrievers. For a lightweight retrieval task, high computational resources and time consumption are major barriers encouraging the renunciation of dense models despite potential gains. In this work, I propose boosting the performance of sparse retrievers by expanding both the queries and the documents with linked entities in two formats for the entity names: 1) explicit and 2) hashed. A zero-shot end-to-end dense entity linking system is employed for entity recognition and disambiguation to augment the corpus. By leveraging the advanced entity linking methods, I believe that the effectiveness gap between sparse and dense retrievers can be narrowed. Experiments are conducted on the MS MARCO passage dataset using the original qrel set, the re-ranked qrels favoured by MonoT5 and the latter set further re-ranked by DuoT5. Since I am concerned with the early stage retrieval in cascaded ranking architectures of large information retrieval systems, the results are evaluated using recall@1000. The suggested approach is also capable of retrieving documents for query subsets judged to be particularly difficult in prior work.
