Table of Contents
Fetching ...

RankMamba: Benchmarking Mamba's Document Ranking Performance in the Era of Transformers

Zhichao Xu

TL;DR

RankMamba investigates whether Mamba, a selective state space model, can match transformer-based LMs for document ranking under standard reranking setups. The study situates attention costs of transformers as $O(n^2)$ during training and $O(n)$ during inference, evaluating Mamba across encoder-only, decoder-only, and encoder-decoder backbones on MSMARCO and DL benchmarks. It finds that Mamba can achieve competitive performance at similar scales, but current implementations lag in training throughput compared to Flash Attention-enabled transformers, though LoRA helps larger Mamba variants. Code is released for reproducibility and to seed further exploration of Mamba in classical IR tasks.

Abstract

Transformer structure has achieved great success in multiple applied machine learning communities, such as natural language processing (NLP), computer vision (CV) and information retrieval (IR). Transformer architecture's core mechanism\, -- \,attention requires $O(n^2)$ time complexity in training and $O(n)$ time complexity in inference. Many works have been proposed to improve the attention mechanism's scalability, such as Flash Attention and Multi-query Attention. A different line of work aims to design new mechanisms to replace attention. Recently, a notable model structure Mamba, which is based on state space models, has achieved transformer-equivalent performance in multiple sequence modeling tasks. In this work, we examine Mamba's efficacy through the lens of a classical IR task\, -- \,document ranking. A reranker model takes a query and a document as input, and predicts a scalar relevance score. This task demands the language model's ability to comprehend lengthy contextual inputs and to capture the interaction between query and document tokens. We find that \textbf{(1) Mamba models achieve competitive performance compared to transformer-based models with the same training recipe; (2) but also have a lower training throughput in comparison to efficient transformer implementations such as flash attention.} We hope this study can serve as a starting point to explore \mamba models in other classical IR tasks. Our \href{https://github.com/zhichaoxu-shufe/RankMamba}{code implementation} is made public to facilitate reproducibility. Refer to~\cite{xu-etal-2025-state} for more comprehensive experiments and results, including passage ranking.

RankMamba: Benchmarking Mamba's Document Ranking Performance in the Era of Transformers

TL;DR

RankMamba investigates whether Mamba, a selective state space model, can match transformer-based LMs for document ranking under standard reranking setups. The study situates attention costs of transformers as during training and during inference, evaluating Mamba across encoder-only, decoder-only, and encoder-decoder backbones on MSMARCO and DL benchmarks. It finds that Mamba can achieve competitive performance at similar scales, but current implementations lag in training throughput compared to Flash Attention-enabled transformers, though LoRA helps larger Mamba variants. Code is released for reproducibility and to seed further exploration of Mamba in classical IR tasks.

Abstract

Transformer structure has achieved great success in multiple applied machine learning communities, such as natural language processing (NLP), computer vision (CV) and information retrieval (IR). Transformer architecture's core mechanism\, -- \,attention requires time complexity in training and time complexity in inference. Many works have been proposed to improve the attention mechanism's scalability, such as Flash Attention and Multi-query Attention. A different line of work aims to design new mechanisms to replace attention. Recently, a notable model structure Mamba, which is based on state space models, has achieved transformer-equivalent performance in multiple sequence modeling tasks. In this work, we examine Mamba's efficacy through the lens of a classical IR task\, -- \,document ranking. A reranker model takes a query and a document as input, and predicts a scalar relevance score. This task demands the language model's ability to comprehend lengthy contextual inputs and to capture the interaction between query and document tokens. We find that \textbf{(1) Mamba models achieve competitive performance compared to transformer-based models with the same training recipe; (2) but also have a lower training throughput in comparison to efficient transformer implementations such as flash attention.} We hope this study can serve as a starting point to explore \mamba models in other classical IR tasks. Our \href{https://github.com/zhichaoxu-shufe/RankMamba}{code implementation} is made public to facilitate reproducibility. Refer to~\cite{xu-etal-2025-state} for more comprehensive experiments and results, including passage ranking.
Paper Structure (17 sections, 8 equations, 1 figure, 3 tables)

This paper contains 17 sections, 8 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: We show the training throughput of models $\approx$330M parameters and models $>$ 700M parameters. Models $>$ 700M parameters are trained with LoRA, with rank=32. We notice Mamba models have lower throughput and higher GPU memory consumption compared to efficient transformer implementations such as flash attention.