Table of Contents
Fetching ...

Multi-Field Adaptive Retrieval

Millicent Li, Tongfei Chen, Benjamin Van Durme, Patrick Xia

TL;DR

This work introduces Multi-Field Adaptive Retrieval (mFAR), a framework for semi-structured document retrieval that decomposes documents into named fields and scores each field with multiple scorers (e.g., lexical and dense). It learns a query-conditioned weighting function $G(q,f,m)$ to adaptively combine per-field, per-scorer scores, producing $s(q,d)$ as a weighted sum over fields and scorers, and it normalizes scores to stabilize learning. Empirically, mFAR achieves state-of-the-art results on the STaRK benchmark, with significant gains from hybrid scoring and true multi-field representations, while maintaining test-time flexibility and avoiding heavy pretraining. Notably, the results show that query-conditioned adaptation is essential, and that the optimal balance of fields and scorers varies by domain, illustrating the practical value of adaptable, structured retrieval for real-world RAG pipelines. This work advances retrieval for semi-structured data and lays groundwork for broader multi-field, multi-scorer integration in knowledge-grounded AI systems, including RAG workflows.

Abstract

Document retrieval for tasks such as search and retrieval-augmented generation typically involves datasets that are unstructured: free-form text without explicit internal structure in each document. However, documents can have a structured form, consisting of fields such as an article title, message body, or HTML header. To address this gap, we introduce Multi-Field Adaptive Retrieval (MFAR), a flexible framework that accommodates any number of and any type of document indices on structured data. Our framework consists of two main steps: (1) the decomposition of an existing document into fields, each indexed independently through dense and lexical methods, and (2) learning a model which adaptively predicts the importance of a field by conditioning on the document query, allowing on-the-fly weighting of the most likely field(s). We find that our approach allows for the optimized use of dense versus lexical representations across field types, significantly improves in document ranking over a number of existing retrievers, and achieves state-of-the-art performance for multi-field structured data.

Multi-Field Adaptive Retrieval

TL;DR

This work introduces Multi-Field Adaptive Retrieval (mFAR), a framework for semi-structured document retrieval that decomposes documents into named fields and scores each field with multiple scorers (e.g., lexical and dense). It learns a query-conditioned weighting function to adaptively combine per-field, per-scorer scores, producing as a weighted sum over fields and scorers, and it normalizes scores to stabilize learning. Empirically, mFAR achieves state-of-the-art results on the STaRK benchmark, with significant gains from hybrid scoring and true multi-field representations, while maintaining test-time flexibility and avoiding heavy pretraining. Notably, the results show that query-conditioned adaptation is essential, and that the optimal balance of fields and scorers varies by domain, illustrating the practical value of adaptable, structured retrieval for real-world RAG pipelines. This work advances retrieval for semi-structured data and lays groundwork for broader multi-field, multi-scorer integration in knowledge-grounded AI systems, including RAG workflows.

Abstract

Document retrieval for tasks such as search and retrieval-augmented generation typically involves datasets that are unstructured: free-form text without explicit internal structure in each document. However, documents can have a structured form, consisting of fields such as an article title, message body, or HTML header. To address this gap, we introduce Multi-Field Adaptive Retrieval (MFAR), a flexible framework that accommodates any number of and any type of document indices on structured data. Our framework consists of two main steps: (1) the decomposition of an existing document into fields, each indexed independently through dense and lexical methods, and (2) learning a model which adaptively predicts the importance of a field by conditioning on the document query, allowing on-the-fly weighting of the most likely field(s). We find that our approach allows for the optimized use of dense versus lexical representations across field types, significantly improves in document ranking over a number of existing retrievers, and achieves state-of-the-art performance for multi-field structured data.

Paper Structure

This paper contains 38 sections, 4 equations, 3 figures, 16 tables.

Figures (3)

  • Figure 1: Traditional documents for retrieval (top), like in MS MARCO nguyen2016ms and BioASQ Nentidis_2023, are unstructured: free-form text that tends to directly answer the queries. Documents in the STaRK datasets (bottom) wu2024starkbenchmarkingllmretrieval, are semi-structured: each contains multiple fields. The queries require information from some of these fields, so it is important to both aggregate evidence across multiple fields while ignoring irrelevant ones.
  • Figure 2: Document D and query Q are examples from the STaRK-MAG dataset. Parts of the query (highlighted) correspond with specific fields from D. Traditional retrievers (A) would score the entire document against the query (e.g. through vector similarity). In (B), our method, mFAR, first decomposes D into fields and scores each field separately against the query using both lexical- and vector-based scorers. This yields a pair of field-specific similarity scores, which are combined using our adaptive query conditioning approach to produce a document-level similarity score.
  • Figure 3: Snippets from the highest-scoring document selected by various mFAR. Top: a single-field hybrid model (mFAR2) vs. mFARAll. mFARAll picks correctly while mFAR2 is possibly confused by negation in the query. Bottom: Snippets from configurations of mFAR with access to different scorers. Only mFARAll correctly makes use of both lexical and semantic matching across fields.