Table of Contents
Fetching ...

Enhancing Dense Retrievers' Robustness with Group-level Reweighting

Peixuan Han, Zhenghao Liu, Zhiyuan Liu, Chenyan Xiong

TL;DR

WebDRO is introduced, an efficient approach for clustering the web graph data and optimizing group weights to enhance the robustness of dense retrieval models and confirms the stability and validity of group weights learned by WebDRO.

Abstract

The anchor-document data derived from web graphs offers a wealth of paired information for training dense retrieval models in an unsupervised manner. However, unsupervised data contains diverse patterns across the web graph and often exhibits significant imbalance, leading to suboptimal performance in underrepresented or difficult groups. In this paper, we introduce WebDRO, an efficient approach for clustering the web graph data and optimizing group weights to enhance the robustness of dense retrieval models. Initially, we build an embedding model for clustering anchor-document pairs. Specifically, we contrastively train the embedding model for link prediction, which guides the embedding model in capturing the document features behind the web graph links. Subsequently, we employ the group distributional robust optimization to recalibrate the weights across different clusters of anchor-document pairs during training retrieval models. During training, we direct the model to assign higher weights to clusters with higher loss and focus more on worst-case scenarios. This approach ensures that the model has strong generalization ability on all data patterns. Our experiments on MS MARCO and BEIR demonstrate that our method can effectively improve retrieval performance in unsupervised training and finetuning settings. Further analysis confirms the stability and validity of group weights learned by WebDRO. The code of this paper can be obtained from https://github.com/Hanpx20/GroupDRO_Dense_Retrieval.

Enhancing Dense Retrievers' Robustness with Group-level Reweighting

TL;DR

WebDRO is introduced, an efficient approach for clustering the web graph data and optimizing group weights to enhance the robustness of dense retrieval models and confirms the stability and validity of group weights learned by WebDRO.

Abstract

The anchor-document data derived from web graphs offers a wealth of paired information for training dense retrieval models in an unsupervised manner. However, unsupervised data contains diverse patterns across the web graph and often exhibits significant imbalance, leading to suboptimal performance in underrepresented or difficult groups. In this paper, we introduce WebDRO, an efficient approach for clustering the web graph data and optimizing group weights to enhance the robustness of dense retrieval models. Initially, we build an embedding model for clustering anchor-document pairs. Specifically, we contrastively train the embedding model for link prediction, which guides the embedding model in capturing the document features behind the web graph links. Subsequently, we employ the group distributional robust optimization to recalibrate the weights across different clusters of anchor-document pairs during training retrieval models. During training, we direct the model to assign higher weights to clusters with higher loss and focus more on worst-case scenarios. This approach ensures that the model has strong generalization ability on all data patterns. Our experiments on MS MARCO and BEIR demonstrate that our method can effectively improve retrieval performance in unsupervised training and finetuning settings. Further analysis confirms the stability and validity of group weights learned by WebDRO. The code of this paper can be obtained from https://github.com/Hanpx20/GroupDRO_Dense_Retrieval.
Paper Structure (19 sections, 6 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 19 sections, 6 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of key steps in WebDRO. Figure \ref{['fig:model:a']} shows the clustering process in Step A, where we use web links to train an embedding model and use the model to cluster documents. Figure \ref{['fig:model:b']} shows the training process of the retrieval model in Step B. We aim to align the anchor with the corresponding document while pushing apart unrelated ones. Each group is assigned a weight that is dynamically updated based on the training loss.
  • Figure 2: The relationship between performance gain of WebDRO and performance of Anchor-DR. Each point represents a dataset in BEIR. Datasets with high or negative gains are annotated.
  • Figure 3: The visualization of group weights during training. Group weights are recorded every 500 steps.
  • Figure 4: Loss landscapes of Anchor-DR and WebDRO. We constrain the perturbation in each direction to $(-0.3, 0.3)$. Each closed curve in the figure indicates that the corresponding positions have the same loss.