Table of Contents
Fetching ...

Scaling Sparse and Dense Retrieval in Decoder-Only LLMs

Hansi Zeng, Julian Killingback, Hamed Zamani

TL;DR

This paper investigates how sparse and dense retrieval paradigms scale in decoder-only LLMs (Llama3 1B/3B/8B) under a fixed compute budget, using MS MARCO for pretraining and evaluation and BEIR for out-of-domain assessment. It introduces MNTP-based pretraining with bidirectional attention to enable sparse projections in decoder-only models and compares CL, KD, and CL+KD finetuning objectives. The findings show that CL drives scalable gains with model size, sparse retrieval outperforms dense across both in-domain and BEIR, and CL+KD at 8B achieves state-of-the-art results across all benchmarks. The work demonstrates strong robustness of sparse retrieval, suggests that KD benefits smaller models more, and provides practical guidance for building high-performance, generalizable retrieval systems with large decoder-only LLMs.

Abstract

Scaling large language models (LLMs) has shown great potential for improving retrieval model performance; however, previous studies have mainly focused on dense retrieval trained with contrastive loss (CL), neglecting the scaling behavior of other retrieval paradigms and optimization techniques, such as sparse retrieval and knowledge distillation (KD). In this work, we conduct a systematic comparative study on how different retrieval paradigms (sparse vs. dense) and fine-tuning objectives (CL vs. KD vs. their combination) affect retrieval performance across different model scales. Using MSMARCO passages as the training dataset, decoder-only LLMs (Llama-3 series: 1B, 3B, 8B), and a fixed compute budget, we evaluate various training configurations on both in-domain (MSMARCO, TREC DL) and out-of-domain (BEIR) benchmarks. Our key findings reveal that: (1) Scaling behaviors emerge clearly only with CL, where larger models achieve significant performance gains, whereas KD-trained models show minimal improvement, performing similarly across the 1B, 3B, and 8B scales. (2) Sparse retrieval models consistently outperform dense retrieval across both in-domain (MSMARCO, TREC DL) and out-of-domain (BEIR) benchmarks, and they demonstrate greater robustness to imperfect supervised signals. (3) We successfully scale sparse retrieval models with the combination of CL and KD losses at 8B scale, achieving state-of-the-art (SOTA) results in all evaluation sets.

Scaling Sparse and Dense Retrieval in Decoder-Only LLMs

TL;DR

This paper investigates how sparse and dense retrieval paradigms scale in decoder-only LLMs (Llama3 1B/3B/8B) under a fixed compute budget, using MS MARCO for pretraining and evaluation and BEIR for out-of-domain assessment. It introduces MNTP-based pretraining with bidirectional attention to enable sparse projections in decoder-only models and compares CL, KD, and CL+KD finetuning objectives. The findings show that CL drives scalable gains with model size, sparse retrieval outperforms dense across both in-domain and BEIR, and CL+KD at 8B achieves state-of-the-art results across all benchmarks. The work demonstrates strong robustness of sparse retrieval, suggests that KD benefits smaller models more, and provides practical guidance for building high-performance, generalizable retrieval systems with large decoder-only LLMs.

Abstract

Scaling large language models (LLMs) has shown great potential for improving retrieval model performance; however, previous studies have mainly focused on dense retrieval trained with contrastive loss (CL), neglecting the scaling behavior of other retrieval paradigms and optimization techniques, such as sparse retrieval and knowledge distillation (KD). In this work, we conduct a systematic comparative study on how different retrieval paradigms (sparse vs. dense) and fine-tuning objectives (CL vs. KD vs. their combination) affect retrieval performance across different model scales. Using MSMARCO passages as the training dataset, decoder-only LLMs (Llama-3 series: 1B, 3B, 8B), and a fixed compute budget, we evaluate various training configurations on both in-domain (MSMARCO, TREC DL) and out-of-domain (BEIR) benchmarks. Our key findings reveal that: (1) Scaling behaviors emerge clearly only with CL, where larger models achieve significant performance gains, whereas KD-trained models show minimal improvement, performing similarly across the 1B, 3B, and 8B scales. (2) Sparse retrieval models consistently outperform dense retrieval across both in-domain (MSMARCO, TREC DL) and out-of-domain (BEIR) benchmarks, and they demonstrate greater robustness to imperfect supervised signals. (3) We successfully scale sparse retrieval models with the combination of CL and KD losses at 8B scale, achieving state-of-the-art (SOTA) results in all evaluation sets.

Paper Structure

This paper contains 18 sections, 6 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Dense and sparse retrieval results on the combined of TREC DL 19 and 20 (in-domain), and BEIR (out-of-domain) datasets.