Numbers Matter! Bringing Quantity-awareness to Retrieval Systems
Satya Almasian, Milena Bruseva, Michael Gertz
TL;DR
This work addresses the challenge of incorporating quantitative semantics into retrieval by proposing two quantity-aware ranking paradigms: a disjoint approach that uses a separate quantity index to re-rank results and a joint approach that fine-tunes neural IR models to integrate textual and numerical reasoning. Central to both is the Comprehensive Quantity Extractor (CQE) for extracting values, units, conditions, and concepts, along with a concept/unit index to connect numbers to meaning. The authors introduce FinQuant and MedQuant, two benchmark datasets for quantity-centric ranking, and demonstrate that disjoint rankers generally outperform joint models on these tasks, with significant gains in P@10, MRR@10, and NDCG@10 and only modest latency overhead. They also show that models trained on synthetic quantity-centered data can generalize across domains and that semantic and lexical queries benefit from quantity-aware reranking, suggesting practical impact for improving retrieval where numerical constraints matter.
Abstract
Quantitative information plays a crucial role in understanding and interpreting the content of documents. Many user queries contain quantities and cannot be resolved without understanding their semantics, e.g., ``car that costs less than $10k''. Yet, modern search engines apply the same ranking mechanisms for both words and quantities, overlooking magnitude and unit information. In this paper, we introduce two quantity-aware ranking techniques designed to rank both the quantity and textual content either jointly or independently. These techniques incorporate quantity information in available retrieval systems and can address queries with numerical conditions equal, greater than, and less than. To evaluate the effectiveness of our proposed models, we introduce two novel quantity-aware benchmark datasets in the domains of finance and medicine and compare our method against various lexical and neural models. The code and data are available under https://github.com/satya77/QuantityAwareRankers.
