Reasoning-enhanced Query Understanding through Decomposition and Interpretation
Yunfei Zhong, Jun Yang, Yixing Fan, Lixin Su, Maarten de Rijke, Ruqing Zhang, Xueqi Cheng
TL;DR
This work tackles the challenge of complex, multi-faceted user queries in information retrieval by introducing ReDI, a reasoning-enhanced QU pipeline that decomposes queries into sub-queries, appends interpretation-rich context, and fuses independent retrievals. By training on a real-world Coin dataset and distilling to a lightweight model, ReDI achieves strong gains over strong baselines on BRIGHT and BEIR for both sparse and dense retrieval, while maintaining practical efficiency. The combination of explicit decomposition and interpretation, along with a flexible fusion strategy, enables improved coverage of intent and more precise document matching, with robust transferability to long documents and out-of-domain data. The work also provides rigorous ablations and analyses of hyperparameters, fine-tuning paradigms, and retriever interactions, and releases code to encourage real-world adoption and further research in reasoning-based QU.
Abstract
Accurate inference of user intent is crucial for enhancing document retrieval in modern search engines. While large language models (LLMs) have made significant strides in this area, their effectiveness has predominantly been assessed with short, keyword-based queries. As AI-driven search evolves, long-form queries with intricate intents are becoming more prevalent, yet they remain underexplored in the context of LLM-based query understanding (QU). To bridge this gap, we introduce ReDI: a Reasoning-enhanced approach for query understanding through Decomposition and Interpretation. ReDI leverages the reasoning and comprehension capabilities of LLMs in a three-stage pipeline: (i) it breaks down complex queries into targeted sub-queries to accurately capture user intent; (ii) it enriches each sub-query with detailed semantic interpretations to improve the query-document matching; and (iii) it independently retrieves documents for each sub-query and employs a fusion strategy to aggregate the results for the final ranking. We compiled a large-scale dataset of real-world complex queries from a major search engine and distilled the query understanding capabilities of teacher models into smaller models for practical application. Experiments on BRIGHT and BEIR demonstrate that ReDI consistently surpasses strong baselines in both sparse and dense retrieval paradigms, affirming its effectiveness. We release our code at https://anonymous.4open.science/r/ReDI-6FC7/.
