Table of Contents
Fetching ...

Single-Turn LLM Reformulation Powered Multi-Stage Hybrid Re-Ranking for Tip-of-the-Tongue Known-Item Retrieval

Debayan Mukhopadhyay, Utshab Kumar Ghosh, Shubham Chatterjee

TL;DR

This work reframes Tip-of-the-Tongue retrieval as an agentic, memory-reconstruction task and introduces a lightweight, zero-shot LLM-based query reformulation that precedes a four-stage hybrid retrieval pipeline. By rewriting ToT queries with off-the-shelf LLMs, the approach surges first-stage recall and enables downstream bi-encoder, cross-encoder, and LLM-based listwise re-ranking to achieve state-of-the-art results on the TREC-ToT 2025 benchmark. Key findings include a $Recall@1000$ uplift of 20.61% from rewriting, and substantial gains in $nDCG@10$, $MRR$, and $MAP@10$ (33.88%, 29.92%, and 29.98%, respectively) compared to raw queries, without any fine-tuning or domain adaptation. The results demonstrate that pre-retrieval cognitive reconstruction, combined with careful stage-wise ranking and efficient decoding, provides a practical, corpus-agnostic path to robust ToT retrieval in open-world settings. The work highlights the importance of treating query interpretation as a first-class component of retrieval and shows how a staged cascade can balance performance with computational cost.

Abstract

Retrieving known items from vague descriptions, Tip-of-the-Tongue (ToT) retrieval, remains a significant challenge. We propose using a single call to a generic 8B-parameter LLM for query reformulation, bridging the gap between ill-formed ToT queries and specific information needs. This method is particularly effective where standard Pseudo-Relevance Feedback fails due to poor initial recall. Crucially, our LLM is not fine-tuned for ToT or specific domains, demonstrating that gains stem from our prompting strategy rather than model specialization. Rewritten queries feed a multi-stage pipeline: sparse retrieval (BM25), dense/late-interaction reranking (Contriever, E5-large-v2, ColBERTv2), monoT5 cross-encoding, and list-wise reranking (Qwen 2.5 72B). Experiments on 2025 TREC-ToT datasets show that while raw queries yield poor performance, our lightweight pre-retrieval transformation improves Recall by 20.61%. Subsequent reranking improves nDCG@10 by 33.88%, MRR by 29.92%, and MAP@10 by 29.98%, offering a cost-effective intervention that unlocks the potential of downstream rankers. Code and data: https://github.com/debayan1405/TREC-TOT-2025

Single-Turn LLM Reformulation Powered Multi-Stage Hybrid Re-Ranking for Tip-of-the-Tongue Known-Item Retrieval

TL;DR

This work reframes Tip-of-the-Tongue retrieval as an agentic, memory-reconstruction task and introduces a lightweight, zero-shot LLM-based query reformulation that precedes a four-stage hybrid retrieval pipeline. By rewriting ToT queries with off-the-shelf LLMs, the approach surges first-stage recall and enables downstream bi-encoder, cross-encoder, and LLM-based listwise re-ranking to achieve state-of-the-art results on the TREC-ToT 2025 benchmark. Key findings include a uplift of 20.61% from rewriting, and substantial gains in , , and (33.88%, 29.92%, and 29.98%, respectively) compared to raw queries, without any fine-tuning or domain adaptation. The results demonstrate that pre-retrieval cognitive reconstruction, combined with careful stage-wise ranking and efficient decoding, provides a practical, corpus-agnostic path to robust ToT retrieval in open-world settings. The work highlights the importance of treating query interpretation as a first-class component of retrieval and shows how a staged cascade can balance performance with computational cost.

Abstract

Retrieving known items from vague descriptions, Tip-of-the-Tongue (ToT) retrieval, remains a significant challenge. We propose using a single call to a generic 8B-parameter LLM for query reformulation, bridging the gap between ill-formed ToT queries and specific information needs. This method is particularly effective where standard Pseudo-Relevance Feedback fails due to poor initial recall. Crucially, our LLM is not fine-tuned for ToT or specific domains, demonstrating that gains stem from our prompting strategy rather than model specialization. Rewritten queries feed a multi-stage pipeline: sparse retrieval (BM25), dense/late-interaction reranking (Contriever, E5-large-v2, ColBERTv2), monoT5 cross-encoding, and list-wise reranking (Qwen 2.5 72B). Experiments on 2025 TREC-ToT datasets show that while raw queries yield poor performance, our lightweight pre-retrieval transformation improves Recall by 20.61%. Subsequent reranking improves nDCG@10 by 33.88%, MRR by 29.92%, and MAP@10 by 29.98%, offering a cost-effective intervention that unlocks the potential of downstream rankers. Code and data: https://github.com/debayan1405/TREC-TOT-2025
Paper Structure (70 sections, 3 equations, 1 figure, 10 tables)

This paper contains 70 sections, 3 equations, 1 figure, 10 tables.

Figures (1)

  • Figure 1: The proposed Retrieval Pipeline with four distinct stages: the Sparse Retrieval Stage (Stage 1), Dense Retrieval Bi-encoder (Stage 2), Cross-Encoder Stage (Stage 3), and the LLM Re-ranker Stage (Stage 4). The input re-written query set in the diagram has been generated using Mistral-7B-Instruct-v0.3, as discussed in Section \ref{['sec:results_analysis']}.