Multi-Field Adaptive Retrieval
Millicent Li, Tongfei Chen, Benjamin Van Durme, Patrick Xia
TL;DR
This work introduces Multi-Field Adaptive Retrieval (mFAR), a framework for semi-structured document retrieval that decomposes documents into named fields and scores each field with multiple scorers (e.g., lexical and dense). It learns a query-conditioned weighting function $G(q,f,m)$ to adaptively combine per-field, per-scorer scores, producing $s(q,d)$ as a weighted sum over fields and scorers, and it normalizes scores to stabilize learning. Empirically, mFAR achieves state-of-the-art results on the STaRK benchmark, with significant gains from hybrid scoring and true multi-field representations, while maintaining test-time flexibility and avoiding heavy pretraining. Notably, the results show that query-conditioned adaptation is essential, and that the optimal balance of fields and scorers varies by domain, illustrating the practical value of adaptable, structured retrieval for real-world RAG pipelines. This work advances retrieval for semi-structured data and lays groundwork for broader multi-field, multi-scorer integration in knowledge-grounded AI systems, including RAG workflows.
Abstract
Document retrieval for tasks such as search and retrieval-augmented generation typically involves datasets that are unstructured: free-form text without explicit internal structure in each document. However, documents can have a structured form, consisting of fields such as an article title, message body, or HTML header. To address this gap, we introduce Multi-Field Adaptive Retrieval (MFAR), a flexible framework that accommodates any number of and any type of document indices on structured data. Our framework consists of two main steps: (1) the decomposition of an existing document into fields, each indexed independently through dense and lexical methods, and (2) learning a model which adaptively predicts the importance of a field by conditioning on the document query, allowing on-the-fly weighting of the most likely field(s). We find that our approach allows for the optimized use of dense versus lexical representations across field types, significantly improves in document ranking over a number of existing retrievers, and achieves state-of-the-art performance for multi-field structured data.
