Table of Contents
Fetching ...

Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers

Zhiyuan Peng, Ting-ruen Wei, Tingyu Song, Yilun Zhao

TL;DR

The paper tackles the challenge of evaluating LLM-based rerankers in information retrieval by introducing FLOPs-based, hardware-agnostic metrics. It defines two metrics, ranking metrics per PetaFLOP (RPP) and queries per PetaFLOP (QPP), and provides an interpretable closed-form FLOPs estimator for both decoder-only and encoder–decoder LLMs, enabling fair comparisons across models and runtimes. Empirical results show FLOPs-normalized metrics reveal efficiency–effectiveness trade-offs more reliably than traditional proxies, with a linear correlation between estimated and measured FLOPs and strong alignment with latency trends. The work delivers an open-source FLOPs calculator and demonstrates the practical relevance of hardware-agnostic cost measures for designing scalable, efficient reranking pipelines.

Abstract

Large Language Models (LLMs) have recently been applied to reranking tasks in information retrieval, achieving strong performance. However, their high computational demands often hinder practical deployment. Existing studies evaluate the efficiency of LLM-based rerankers using proxy metrics such as latency, the number of forward passes, input tokens, and output tokens. However, these metrics depend on hardware and running-time choices (\eg parallel or not, batch size, etc), and often fail to account for model size, making it difficult to interpret and obscuring the evaluation of the efficiency-effectiveness tradeoff. To address this issue, we propose \ours\footnote{https://github.com/zhiyuanpeng/EER-FLOPs.} for LLM-based rerankers: RPP (ranking metrics per PetaFLOP), measuring how much ranking quality (e.g., NDCG or MRR) a method achieves per PetaFLOP, and QPP (queries per PetaFLOP), measuring how many queries can be processed per PetaFLOP. Accompanied by the new metrics, an interpretable FLOPs estimator is developed to estimate the FLOPs of an LLM-based reranker even without running any experiments. Based on the proposed metrics, we conduct comprehensive experiments to evaluate a wide range of LLM-based rerankers with different architectures, studying the efficiency-effectiveness trade-off and bringing this issue to the attention of the research community.

Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers

TL;DR

The paper tackles the challenge of evaluating LLM-based rerankers in information retrieval by introducing FLOPs-based, hardware-agnostic metrics. It defines two metrics, ranking metrics per PetaFLOP (RPP) and queries per PetaFLOP (QPP), and provides an interpretable closed-form FLOPs estimator for both decoder-only and encoder–decoder LLMs, enabling fair comparisons across models and runtimes. Empirical results show FLOPs-normalized metrics reveal efficiency–effectiveness trade-offs more reliably than traditional proxies, with a linear correlation between estimated and measured FLOPs and strong alignment with latency trends. The work delivers an open-source FLOPs calculator and demonstrates the practical relevance of hardware-agnostic cost measures for designing scalable, efficient reranking pipelines.

Abstract

Large Language Models (LLMs) have recently been applied to reranking tasks in information retrieval, achieving strong performance. However, their high computational demands often hinder practical deployment. Existing studies evaluate the efficiency of LLM-based rerankers using proxy metrics such as latency, the number of forward passes, input tokens, and output tokens. However, these metrics depend on hardware and running-time choices (\eg parallel or not, batch size, etc), and often fail to account for model size, making it difficult to interpret and obscuring the evaluation of the efficiency-effectiveness tradeoff. To address this issue, we propose \ours\footnote{https://github.com/zhiyuanpeng/EER-FLOPs.} for LLM-based rerankers: RPP (ranking metrics per PetaFLOP), measuring how much ranking quality (e.g., NDCG or MRR) a method achieves per PetaFLOP, and QPP (queries per PetaFLOP), measuring how many queries can be processed per PetaFLOP. Accompanied by the new metrics, an interpretable FLOPs estimator is developed to estimate the FLOPs of an LLM-based reranker even without running any experiments. Based on the proposed metrics, we conduct comprehensive experiments to evaluate a wide range of LLM-based rerankers with different architectures, studying the efficiency-effectiveness trade-off and bringing this issue to the attention of the research community.

Paper Structure

This paper contains 22 sections, 17 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Linear trends between estimated and measured FLOPs for decoder (left) and encoder-decoder (right) models of various sizes on DL19. The same is observed for the DL20 dataset.
  • Figure 2: Latency in milliseconds increases with FLOPs on Qwen-7B (left) and Flan-T5-XXL (right). The Pearson correlation coefficients between latency and estimated FLOP counts are 0.88 for Qwen-7B and 0.94 for Flan-T5-XXL.
  • Figure 3: FLOPs increases with prompt length for Qwen-7B (left) and Flan-T5-XL (right) on the DL19 dataset.