Table of Contents
Fetching ...

Rank It, Then Ask It: Input Reranking for Maximizing the Performance of LLMs on Symmetric Tasks

Mohsen Dehghankar, Abolfazl Asudeh

TL;DR

This work tackles improving LLM performance on symmetric tasks by reordering a bag of input elements before querying the model. It introduces the LLM input reranking problem and a two‑stage solution: (i) exposure discovery to learn rank‑position recall patterns for a given LLM, and (ii) relevance estimation to rank elements by their expected impact on the query answer, using a bipartite graph framework to debias scores. The expected utility is defined as $\mathbb{E}[utility(\pi|q)] = \sum_{i=1}^{|\mathcal{I}|} \mathbb{E}[\mathcal{X}_{\mathcal{L}}(i)] \cdot \mathbb{E}[Rel_q(e_{\pi(i)})]$, guiding the reranking. Experimental results on Graph Degree tasks and real DB queries show reranking can achieve up to 99% proximity to the optimum bound, with notable differences in memory patterns between GPT‑3.5 Turbo and GPT‑4o Mini. The approach remains model‑agnostic and acts as a wrapper to enhance symmetric‑task performance for current and future LLMs.

Abstract

Large language models (LLMs) have quickly emerged as practical and versatile tools that provide new solutions for a wide range of domains. In this paper, we consider the application of LLMs on symmetric tasks where a query is asked on an (unordered) bag of elements. Examples of such tasks include answering aggregate queries on a database table. In general, when the bag contains a large number of elements, LLMs tend to overlook some elements, leading to challenges in generating accurate responses to the query. LLMs receive their inputs as ordered sequences. However, in this problem, we leverage the fact that the symmetric input is not ordered, and reordering should not affect the LLM's response. Observing that LLMs are less likely to miss elements at certain positions of the input, we introduce the problem of LLM input reranking: to find a ranking of the input that maximizes the LLM's accuracy for the given query without making explicit assumptions about the query. Finding the optimal ranking requires identifying (i) the relevance of each input element for answering the query and (ii) the importance of each rank position for the LLM's attention. We develop algorithms for estimating these values efficiently utilizing a helper LLM. We conduct comprehensive experiments on different synthetic and real datasets to validate our proposal and to evaluate the effectiveness of our proposed algorithms. Our experiments confirm that our reranking approach improves the accuracy of the LLMs on symmetric tasks by up to $99\%$ proximity to the optimum upper bound.

Rank It, Then Ask It: Input Reranking for Maximizing the Performance of LLMs on Symmetric Tasks

TL;DR

This work tackles improving LLM performance on symmetric tasks by reordering a bag of input elements before querying the model. It introduces the LLM input reranking problem and a two‑stage solution: (i) exposure discovery to learn rank‑position recall patterns for a given LLM, and (ii) relevance estimation to rank elements by their expected impact on the query answer, using a bipartite graph framework to debias scores. The expected utility is defined as , guiding the reranking. Experimental results on Graph Degree tasks and real DB queries show reranking can achieve up to 99% proximity to the optimum bound, with notable differences in memory patterns between GPT‑3.5 Turbo and GPT‑4o Mini. The approach remains model‑agnostic and acts as a wrapper to enhance symmetric‑task performance for current and future LLMs.

Abstract

Large language models (LLMs) have quickly emerged as practical and versatile tools that provide new solutions for a wide range of domains. In this paper, we consider the application of LLMs on symmetric tasks where a query is asked on an (unordered) bag of elements. Examples of such tasks include answering aggregate queries on a database table. In general, when the bag contains a large number of elements, LLMs tend to overlook some elements, leading to challenges in generating accurate responses to the query. LLMs receive their inputs as ordered sequences. However, in this problem, we leverage the fact that the symmetric input is not ordered, and reordering should not affect the LLM's response. Observing that LLMs are less likely to miss elements at certain positions of the input, we introduce the problem of LLM input reranking: to find a ranking of the input that maximizes the LLM's accuracy for the given query without making explicit assumptions about the query. Finding the optimal ranking requires identifying (i) the relevance of each input element for answering the query and (ii) the importance of each rank position for the LLM's attention. We develop algorithms for estimating these values efficiently utilizing a helper LLM. We conduct comprehensive experiments on different synthetic and real datasets to validate our proposal and to evaluate the effectiveness of our proposed algorithms. Our experiments confirm that our reranking approach improves the accuracy of the LLMs on symmetric tasks by up to proximity to the optimum upper bound.

Paper Structure

This paper contains 32 sections, 1 theorem, 16 equations, 4 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

The process described in Equation eq:update-s-beta would eventually converge.

Figures (4)

  • Figure 1: Illustrating the average error of GPT-3.5 Turbo on Graph Degree Task based on different input sizes. The error is the absolute difference between the real degree of a node (less than 20 in this case) and the reported degree by the LLM.
  • Figure 2: An example of the bipartite representation of evaluations. Each node in $U$ represents an element in $\mathcal{I}$ and its final score. Each node in $V$ represents one evaluation done on a partition $P_{i, k}$ from a shuffled list $E_i$ and the associated bias with that. The weights on the edges are the scores assigned by the helper LLM $\mathcal{H}$ to elements.
  • Figure 3: Exposure values for 'GPT-3.5 Turbo' and 'GPT-4o Mini.' The red plot represents the normalized error observed when placing relevant data at specific positions within a prompt of length 1000, averaged over 100 runs. In our model, the inverse of the average error at each position is proportional to the exposure $\mathcal{X}_{\mathcal{L}}(i)$. Higher error at a given location indicates lower exposure at that index.
  • Figure 4: Verifying the effect of the exposure function $\mathcal{X}_{\mathcal{L}}$ on the sorted list for GPT-4o Mini. For this LLM, sorting $\mathcal{I}$ in descending order results in the highest error rate. However, applying the exposure function significantly reduces the error. The helper LLM for this result is DeepSeek-Coder-v2.

Theorems & Definitions (3)

  • Definition 1: Bipartite Evaluation Graph
  • Theorem 1
  • Definition 2: Proximity