Table of Contents
Fetching ...

HyQE: Ranking Contexts with Hypothetical Query Embeddings

Weichao Zhou, Jiaxin Zhang, Hilaf Hasson, Anu Singh, Wenchao Li

TL;DR

This work introduces a scalable ranking framework that combines embedding similarity and LLM capabilities without requiring LLM fine-tuning and is compatible with many other retrieval and ranking techniques.

Abstract

In retrieval-augmented systems, context ranking techniques are commonly employed to reorder the retrieved contexts based on their relevance to a user query. A standard approach is to measure this relevance through the similarity between contexts and queries in the embedding space. However, such similarity often fails to capture the relevance. Alternatively, large language models (LLMs) have been used for ranking contexts. However, they can encounter scalability issues when the number of candidate contexts grows and the context window sizes of the LLMs remain constrained. Additionally, these approaches require fine-tuning LLMs with domain-specific data. In this work, we introduce a scalable ranking framework that combines embedding similarity and LLM capabilities without requiring LLM fine-tuning. Our framework uses a pre-trained LLM to hypothesize the user query based on the retrieved contexts and ranks the context based on the similarity between the hypothesized queries and the user query. Our framework is efficient at inference time and is compatible with many other retrieval and ranking techniques. Experimental results show that our method improves the ranking performance across multiple benchmarks. The complete code and data are available at https://github.com/zwc662/hyqe

HyQE: Ranking Contexts with Hypothetical Query Embeddings

TL;DR

This work introduces a scalable ranking framework that combines embedding similarity and LLM capabilities without requiring LLM fine-tuning and is compatible with many other retrieval and ranking techniques.

Abstract

In retrieval-augmented systems, context ranking techniques are commonly employed to reorder the retrieved contexts based on their relevance to a user query. A standard approach is to measure this relevance through the similarity between contexts and queries in the embedding space. However, such similarity often fails to capture the relevance. Alternatively, large language models (LLMs) have been used for ranking contexts. However, they can encounter scalability issues when the number of candidate contexts grows and the context window sizes of the LLMs remain constrained. Additionally, these approaches require fine-tuning LLMs with domain-specific data. In this work, we introduce a scalable ranking framework that combines embedding similarity and LLM capabilities without requiring LLM fine-tuning. Our framework uses a pre-trained LLM to hypothesize the user query based on the retrieved contexts and ranks the context based on the similarity between the hypothesized queries and the user query. Our framework is efficient at inference time and is compatible with many other retrieval and ranking techniques. Experimental results show that our method improves the ranking performance across multiple benchmarks. The complete code and data are available at https://github.com/zwc662/hyqe

Paper Structure

This paper contains 12 sections, 6 equations, 14 figures, 9 tables, 1 algorithm.

Figures (14)

  • Figure 1: A flow chart of HyQE ranking framework. Given a query $q$ and a retrieved context $c$, an LLM $H$ is used to generate a set of hypothetical queries $\hat{q}$ from $c$. Then an embedding model $E$ is used to evaluate the semantic similarity between $q$ and ${\hat{q}}$'s. Then cosine similarity is used to determine whether $c$ is relevant to $q$ as in Eq.\ref{['eq:hyqe']}.
  • Figure 2: Prompt for hypothetical query generation. '{context}' is the placeholder for the context to be filled.
  • Figure 3: The random variables $c$ and $q$ respectively indicate context and user input query. (a) Cosine similarity prioritizes semantic similarity rather than retrieving a better context for answering the query. (b) The causality relationship in query expansion methods such as HyDE. The random variable $\hat{c}$ is a hypothetical context, and $D$ indicates the prior knowledge of the LLM used to generate $\hat{c}$. In this example, we use GPT-3.5-turbo to generate a hypothetical context $\hat{c}$ to answer the question in $q$. However, $\hat{c}$ contains outdated information and cannot be used to retrieve the most relevant context $c$ through semantic search. (c) The causality relationship in HyQE. An LLM $H$ is used to generate the hypothetical query $\hat{q}$. The causal relationship $q$ and $\hat{q}$ can be simulated with causal similarity.
  • Figure 4: ICA on the bge-base-env-v1.5 embeddings for $4$ queries from COVID dataset. Each figure corresponds to one query. The large purple circle represent a query in the dataset. The red squares represent the top 5 contexts ranked using cosine similarity, and the red triangles represent the corresponding hypothetical queries. The green squares represent the top 5 contexts ranked using our method, and the green triangles represent the corresponding hypothetical queries. The full analysis on all the $50$ queries can be found in Appendix \ref{['sec:app_0']}.
  • Figure 5: The statistics of the number of hypothetical queries generated for the contexts in COVID datasets. The x-axis indicates the number of hypothetical queries generated for a context. The y-axis indicates the percentage of contexts in the dataset. The full results on all the dataset can be found in Appendix.\ref{['sec:app_2']}.
  • ...and 9 more figures