Information Retrieval: Recent Advances and Beyond
Kailash A. Hambarde, Hugo Proenca
TL;DR
The paper surveys information retrieval (IR) methods across two stages—initial retrieval and subsequent ranking—emphasizing the shift from traditional term-based matching to semantic and neural approaches enabled by large datasets and compute. It categorizes retrieval into sparse, dense, and hybrid methods, and discusses both first-stage retrieval and second-stage ranking, including pre-training objectives and expansion techniques. Key contributions include a taxonomy of retrieval methods, a synthesis of historical and modern techniques, datasets, and identified challenges such as long-tail and multilingual queries, with guidance for researchers and practitioners. The work highlights practical trade-offs and future directions to build scalable, accurate IR systems across diverse tasks and domains.
Abstract
In this paper, we provide a detailed overview of the models used for information retrieval in the first and second stages of the typical processing chain. We discuss the current state-of-the-art models, including methods based on terms, semantic retrieval, and neural. Additionally, we delve into the key topics related to the learning process of these models. This way, this survey offers a comprehensive understanding of the field and is of interest for for researchers and practitioners entering/working in the information retrieval domain.
