Single-Turn LLM Reformulation Powered Multi-Stage Hybrid Re-Ranking for Tip-of-the-Tongue Known-Item Retrieval
Debayan Mukhopadhyay, Utshab Kumar Ghosh, Shubham Chatterjee
TL;DR
This work reframes Tip-of-the-Tongue retrieval as an agentic, memory-reconstruction task and introduces a lightweight, zero-shot LLM-based query reformulation that precedes a four-stage hybrid retrieval pipeline. By rewriting ToT queries with off-the-shelf LLMs, the approach surges first-stage recall and enables downstream bi-encoder, cross-encoder, and LLM-based listwise re-ranking to achieve state-of-the-art results on the TREC-ToT 2025 benchmark. Key findings include a $Recall@1000$ uplift of 20.61% from rewriting, and substantial gains in $nDCG@10$, $MRR$, and $MAP@10$ (33.88%, 29.92%, and 29.98%, respectively) compared to raw queries, without any fine-tuning or domain adaptation. The results demonstrate that pre-retrieval cognitive reconstruction, combined with careful stage-wise ranking and efficient decoding, provides a practical, corpus-agnostic path to robust ToT retrieval in open-world settings. The work highlights the importance of treating query interpretation as a first-class component of retrieval and shows how a staged cascade can balance performance with computational cost.
Abstract
Retrieving known items from vague descriptions, Tip-of-the-Tongue (ToT) retrieval, remains a significant challenge. We propose using a single call to a generic 8B-parameter LLM for query reformulation, bridging the gap between ill-formed ToT queries and specific information needs. This method is particularly effective where standard Pseudo-Relevance Feedback fails due to poor initial recall. Crucially, our LLM is not fine-tuned for ToT or specific domains, demonstrating that gains stem from our prompting strategy rather than model specialization. Rewritten queries feed a multi-stage pipeline: sparse retrieval (BM25), dense/late-interaction reranking (Contriever, E5-large-v2, ColBERTv2), monoT5 cross-encoding, and list-wise reranking (Qwen 2.5 72B). Experiments on 2025 TREC-ToT datasets show that while raw queries yield poor performance, our lightweight pre-retrieval transformation improves Recall by 20.61%. Subsequent reranking improves nDCG@10 by 33.88%, MRR by 29.92%, and MAP@10 by 29.98%, offering a cost-effective intervention that unlocks the potential of downstream rankers. Code and data: https://github.com/debayan1405/TREC-TOT-2025
