Table of Contents
Fetching ...

Beyond Questions: Leveraging ColBERT for Keyphrase Search

Jorge Gabín, Javier Parapar, Craig Macdonald

TL;DR

This work addresses the gap that dense retrieval models, trained predominantly on question-like queries, struggle with keyphrase search prevalent in academic and professional contexts. It introduces two ColBERT-based rankers, ColBERTKP_QD and ColBERTKP_Q, and an auxiliary monoT5KP model, built on keyphrase-form queries generated by LLMs to create keyphrase-focused training data; training strategies include full joint training or encoder-only training to reduce costs. Across automatically generated and manually annotated keyphrases, the keyphrase-tailored models outperform baselines on keyphrase search while maintaining competitive performance on original question-like queries, and they generalize to other architectures and traditional title-based queries. The findings highlight enhanced semantic and special-token matching and reduced lexical reliance, with practical implications for improving specialised search and enabling robust performance in mixed-query environments. The work also outlines future enhancements such as adopting ColBERTv2, model distillation, and a query-type classifier to dynamically select the appropriate encoder for a given query.

Abstract

While question-like queries are gaining popularity and search engines' users increasingly adopt them, keyphrase search has traditionally been the cornerstone of web search. This query type is also prevalent in specialised search tasks such as academic or professional search, where experts rely on keyphrases to articulate their information needs. However, current dense retrieval models often fail with keyphrase-like queries, primarily because they are mostly trained on question-like ones. This paper introduces a novel model that employs the ColBERT architecture to enhance document ranking for keyphrase queries. For that, given the lack of large keyphrase-based retrieval datasets, we first explore how Large Language Models can convert question-like queries into keyphrase format. Then, using those keyphrases, we train a keyphrase-based ColBERT ranker (ColBERTKP_QD) to improve the performance when working with keyphrase queries. Furthermore, to reduce the training costs associated with training the full ColBERT model, we investigate the feasibility of training only a keyphrase query encoder while keeping the document encoder weights static (ColBERTKP_Q). We assess our proposals' ranking performance using both automatically generated and manually annotated keyphrases. Our results reveal the potential of the late interaction architecture when working under the keyphrase search scenario.

Beyond Questions: Leveraging ColBERT for Keyphrase Search

TL;DR

This work addresses the gap that dense retrieval models, trained predominantly on question-like queries, struggle with keyphrase search prevalent in academic and professional contexts. It introduces two ColBERT-based rankers, ColBERTKP_QD and ColBERTKP_Q, and an auxiliary monoT5KP model, built on keyphrase-form queries generated by LLMs to create keyphrase-focused training data; training strategies include full joint training or encoder-only training to reduce costs. Across automatically generated and manually annotated keyphrases, the keyphrase-tailored models outperform baselines on keyphrase search while maintaining competitive performance on original question-like queries, and they generalize to other architectures and traditional title-based queries. The findings highlight enhanced semantic and special-token matching and reduced lexical reliance, with practical implications for improving specialised search and enabling robust performance in mixed-query environments. The work also outlines future enhancements such as adopting ColBERTv2, model distillation, and a query-type classifier to dynamically select the appropriate encoder for a given query.

Abstract

While question-like queries are gaining popularity and search engines' users increasingly adopt them, keyphrase search has traditionally been the cornerstone of web search. This query type is also prevalent in specialised search tasks such as academic or professional search, where experts rely on keyphrases to articulate their information needs. However, current dense retrieval models often fail with keyphrase-like queries, primarily because they are mostly trained on question-like ones. This paper introduces a novel model that employs the ColBERT architecture to enhance document ranking for keyphrase queries. For that, given the lack of large keyphrase-based retrieval datasets, we first explore how Large Language Models can convert question-like queries into keyphrase format. Then, using those keyphrases, we train a keyphrase-based ColBERT ranker (ColBERTKP_QD) to improve the performance when working with keyphrase queries. Furthermore, to reduce the training costs associated with training the full ColBERT model, we investigate the feasibility of training only a keyphrase query encoder while keeping the document encoder weights static (ColBERTKP_Q). We assess our proposals' ranking performance using both automatically generated and manually annotated keyphrases. Our results reveal the potential of the late interaction architecture when working under the keyphrase search scenario.

Paper Structure

This paper contains 28 sections, 7 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Training process of the ColBERTKP$_{Q}$ model using the transformed MSMarco training triples.
  • Figure 2: ColBERT and ColBERTKP$_{Q}$ interaction for query 962179 (in both original and keyphrase format) and passage 2329699 (shortened). Darker shading in the interaction matrix denotes higher similarity, while the × symbol highlights the document embedding (row) with the highest similarity for each query embedding (column). The histogram above illustrates each query embedding's contribution to the documents's final score, with shading indicating the magnitude of the contribution.
  • Figure 3: Performance across different types of queries according to the query type taxonomy presented by bolotova2022non. Original and Mistral keyphrases bars show the delta between using ColBERTKP$_{Q}$ and ColBERT as the retrieval model. ColBERTKP$_{Q}$ and ColBERT bars show the delta between using keyphrases and original queries.
  • Figure 4: Representation of each matching type.