Table of Contents
Fetching ...

Scaling Laws For Dense Retrieval

Yan Fang, Jingtao Zhan, Qingyao Ai, Jiaxin Mao, Weihang Su, Jia Chen, Yiqun Liu

TL;DR

This paper demonstrates that dense retrieval models exhibit clear power-law scaling with respect to both model size and annotated data size when evaluated with a continuous metric, contrastive entropy. By systematically varying pre-trained backbones and annotation strategies on MSMARCO and T2Ranking, the authors establish model-size and data-size scaling laws and introduce a joint law to capture their interaction, enabling budget-aware resource planning. They further show how annotation quality modulates scaling and propose an application to optimize training under cost constraints, including the impact of inference costs. The work provides a practical framework for predicting DR performance, guiding data collection, model selection, and annotation strategies, while outlining limitations and directions for expanding scaling analyses to broader architectures and domains.

Abstract

Scaling up neural models has yielded significant advancements in a wide array of tasks, particularly in language generation. Previous studies have found that the performance of neural models frequently adheres to predictable scaling laws, correlated with factors such as training set size and model size. This insight is invaluable, especially as large-scale experiments grow increasingly resource-intensive. Yet, such scaling law has not been fully explored in dense retrieval due to the discrete nature of retrieval metrics and complex relationships between training data and model sizes in retrieval tasks. In this study, we investigate whether the performance of dense retrieval models follows the scaling law as other neural models. We propose to use contrastive log-likelihood as the evaluation metric and conduct extensive experiments with dense retrieval models implemented with different numbers of parameters and trained with different amounts of annotated data. Results indicate that, under our settings, the performance of dense retrieval models follows a precise power-law scaling related to the model size and the number of annotations. Additionally, we examine scaling with prevalent data augmentation methods to assess the impact of annotation quality, and apply the scaling law to find the best resource allocation strategy under a budget constraint. We believe that these insights will significantly contribute to understanding the scaling effect of dense retrieval models and offer meaningful guidance for future research endeavors.

Scaling Laws For Dense Retrieval

TL;DR

This paper demonstrates that dense retrieval models exhibit clear power-law scaling with respect to both model size and annotated data size when evaluated with a continuous metric, contrastive entropy. By systematically varying pre-trained backbones and annotation strategies on MSMARCO and T2Ranking, the authors establish model-size and data-size scaling laws and introduce a joint law to capture their interaction, enabling budget-aware resource planning. They further show how annotation quality modulates scaling and propose an application to optimize training under cost constraints, including the impact of inference costs. The work provides a practical framework for predicting DR performance, guiding data collection, model selection, and annotation strategies, while outlining limitations and directions for expanding scaling analyses to broader architectures and domains.

Abstract

Scaling up neural models has yielded significant advancements in a wide array of tasks, particularly in language generation. Previous studies have found that the performance of neural models frequently adheres to predictable scaling laws, correlated with factors such as training set size and model size. This insight is invaluable, especially as large-scale experiments grow increasingly resource-intensive. Yet, such scaling law has not been fully explored in dense retrieval due to the discrete nature of retrieval metrics and complex relationships between training data and model sizes in retrieval tasks. In this study, we investigate whether the performance of dense retrieval models follows the scaling law as other neural models. We propose to use contrastive log-likelihood as the evaluation metric and conduct extensive experiments with dense retrieval models implemented with different numbers of parameters and trained with different amounts of annotated data. Results indicate that, under our settings, the performance of dense retrieval models follows a precise power-law scaling related to the model size and the number of annotations. Additionally, we examine scaling with prevalent data augmentation methods to assess the impact of annotation quality, and apply the scaling law to find the best resource allocation strategy under a budget constraint. We believe that these insights will significantly contribute to understanding the scaling effect of dense retrieval models and offer meaningful guidance for future research endeavors.
Paper Structure (21 sections, 10 equations, 8 figures, 2 tables)

This paper contains 21 sections, 10 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Performance of various models on MSMARCO Passage Ranking (left) and T2Ranking (right) datasets. It shows the number of non-embedding parameters (x-axis) and the test-set contrastive entropy (y-axis). The stars and points represent the actual performance. The curves are derived from the scaling law and match the observed data.
  • Figure 2: Relationship between standard ranking metrics and contrastive entropy for different Dense Retrieval models on the MSMARCO Passage Ranking dataset. The figures illustrate the contrastive entropy (x-axis) versus standard ranking metrics (y-axis). The results indicate a strong positive correlation. Besides, the figures highlight an emergent ability phenomenon wei2022emergent around a contrastive entropy value of approximately 0.25, where there is a significant improvement in ranking metrics.
  • Figure 3: Scaling laws for model performance as a function of model size on MSMARCO Passage Ranking (left) and T2Ranking (right) datasets. The figures display the contrastive entropy (y-axis) against the number of non-embedding parameters (x-axis, logarithmic scale) for different models. Points and stars represent the actual performance, aligning closely along a straight line. The dashed lines are fitted using Eq. (\ref{['eq:model_size_scaling_law']}), demonstrating a close match with the empirical data.
  • Figure 4: Scaling laws for model performance relative to training data size on MSMARCO Passage Ranking (left) and T2Ranking (right) datasets. The figures illustrate the contrastive entropy (y-axis) as a function of the number of annotated query-passage pairs (x-axis, logarithmic scale) for a fixed model size. Points and stars show the actual performance, aligning closely with a straight line. The dashed lines are fitted using Eq. (\ref{['eq:data_size_scaling_law']}), demonstrating a strong fit with the empirical data.
  • Figure 5: Scaling effects of annotation quality for retrieval performance on MS MARCO. Dashed lines are fitted using Eq. (\ref{['eq:data_size_scaling_law']}), which demonstrate the power-law scaling across different annotation methods. ChatGLM3 annotations exhibit the steepest slope and surpass human annotations at 500k.
  • ...and 3 more figures