Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers
Zhiyuan Peng, Ting-ruen Wei, Tingyu Song, Yilun Zhao
TL;DR
The paper tackles the challenge of evaluating LLM-based rerankers in information retrieval by introducing FLOPs-based, hardware-agnostic metrics. It defines two metrics, ranking metrics per PetaFLOP (RPP) and queries per PetaFLOP (QPP), and provides an interpretable closed-form FLOPs estimator for both decoder-only and encoder–decoder LLMs, enabling fair comparisons across models and runtimes. Empirical results show FLOPs-normalized metrics reveal efficiency–effectiveness trade-offs more reliably than traditional proxies, with a linear correlation between estimated and measured FLOPs and strong alignment with latency trends. The work delivers an open-source FLOPs calculator and demonstrates the practical relevance of hardware-agnostic cost measures for designing scalable, efficient reranking pipelines.
Abstract
Large Language Models (LLMs) have recently been applied to reranking tasks in information retrieval, achieving strong performance. However, their high computational demands often hinder practical deployment. Existing studies evaluate the efficiency of LLM-based rerankers using proxy metrics such as latency, the number of forward passes, input tokens, and output tokens. However, these metrics depend on hardware and running-time choices (\eg parallel or not, batch size, etc), and often fail to account for model size, making it difficult to interpret and obscuring the evaluation of the efficiency-effectiveness tradeoff. To address this issue, we propose \ours\footnote{https://github.com/zhiyuanpeng/EER-FLOPs.} for LLM-based rerankers: RPP (ranking metrics per PetaFLOP), measuring how much ranking quality (e.g., NDCG or MRR) a method achieves per PetaFLOP, and QPP (queries per PetaFLOP), measuring how many queries can be processed per PetaFLOP. Accompanied by the new metrics, an interpretable FLOPs estimator is developed to estimate the FLOPs of an LLM-based reranker even without running any experiments. Based on the proposed metrics, we conduct comprehensive experiments to evaluate a wide range of LLM-based rerankers with different architectures, studying the efficiency-effectiveness trade-off and bringing this issue to the attention of the research community.
