AdaRewriter: Unleashing the Power of Prompting-based Conversational Query Reformulation via Test-Time Adaptation
Yilong Lai, Jialong Wu, Zhenglin Wang, Deyu Zhou
TL;DR
This paper tackles the gap in prompting-based conversational query reformulation by introducing AdaRewriter, a lightweight reward-model framework that enables test-time adaptation under the Best-of-N paradigm.AdaRewriter trains a compact reward model with a contrastive ranking objective to score reformulation candidates generated by an LLM, selecting the top reformulation for retrieval in both sparse and dense settings, including black-box LLM APIs.Across multiple conversational search benchmarks (TopiOCQA, QReCC, and zero-shot CAsT), AdaRewriter yields consistent improvements over training-time tuning baselines and state-of-the-art prompting methods, especially as the candidate pool grows.The results demonstrate that test-time adaptation can unlock substantial gains in conversational query reformulation, with practical impact for real-world systems that rely on external LLMs and heterogeneous retrievers.
Abstract
Prompting-based conversational query reformulation has emerged as a powerful approach for conversational search, refining ambiguous user queries into standalone search queries. Best-of-N reformulation over the generated candidates via prompting shows impressive potential scaling capability. However, both the previous tuning methods (training time) and adaptation approaches (test time) can not fully unleash their benefits. In this paper, we propose AdaRewriter, a novel framework for query reformulation using an outcome-supervised reward model via test-time adaptation. By training a lightweight reward model with contrastive ranking loss, AdaRewriter selects the most promising reformulation during inference. Notably, it can operate effectively in black-box systems, including commercial LLM APIs. Experiments on five conversational search datasets show that AdaRewriter significantly outperforms the existing methods across most settings, demonstrating the potential of test-time adaptation for conversational query reformulation.
