T2Ranking: A large-scale Chinese Benchmark for Passage Ranking

Xiaohui Xie; Qian Dong; Bingning Wang; Feiyang Lv; Ting Yao; Weinan Gan; Zhijing Wu; Xiangsheng Li; Haitao Li; Yiqun Liu; Jin Ma

T2Ranking: A large-scale Chinese Benchmark for Passage Ranking

Xiaohui Xie, Qian Dong, Bingning Wang, Feiyang Lv, Ting Yao, Weinan Gan, Zhijing Wu, Xiangsheng Li, Haitao Li, Yiqun Liu, Jin Ma

TL;DR

T2Ranking addresses the shortage of large-scale Chinese passage-ranking benchmarks with fine-grained relevance by constructing a Chinese dataset from real search logs. It combines model-based passage segmentation, Ward clustering for deduplication, and active-learning data sampling to produce high-quality, diverse training data for retrieval and re-ranking. The paper provides extensive baselines using sparse and dense retrieval methods and a cross-encoder re-ranker, revealing that the dataset is highly challenging due to fine-grained labels and query diversity. By making the data and code publicly available, T2Ranking aims to drive progress in Chinese IR and more robust evaluation of passage-ranking models.

Abstract

Passage ranking involves two stages: passage retrieval and passage re-ranking, which are important and challenging topics for both academics and industries in the area of Information Retrieval (IR). However, the commonly-used datasets for passage ranking usually focus on the English language. For non-English scenarios, such as Chinese, the existing datasets are limited in terms of data scale, fine-grained relevance annotation and false negative issues. To address this problem, we introduce T2Ranking, a large-scale Chinese benchmark for passage ranking. T2Ranking comprises more than 300K queries and over 2M unique passages from real-world search engines. Expert annotators are recruited to provide 4-level graded relevance scores (fine-grained) for query-passage pairs instead of binary relevance judgments (coarse-grained). To ease the false negative issues, more passages with higher diversities are considered when performing relevance annotations, especially in the test set, to ensure a more accurate evaluation. Apart from the textual query and passage data, other auxiliary resources are also provided, such as query types and XML files of documents which passages are generated from, to facilitate further studies. To evaluate the dataset, commonly used ranking models are implemented and tested on T2Ranking as baselines. The experimental results show that T2Ranking is challenging and there is still scope for improvement. The full data and all codes are available at https://github.com/THUIR/T2Ranking/

T2Ranking: A large-scale Chinese Benchmark for Passage Ranking

TL;DR

Abstract

Paper Structure (13 sections, 8 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 13 sections, 8 equations, 6 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Task Definition
Dataset Construction
Overall Pipeline
Model-based Passage Segmentation
Clustering-based Passage De-duplication
Active Learning-based Data Sampling
Data Statistics
Experiments and Results
Retrieval Performance
Re-ranking Performance
Conclusion

Figures (6)

Figure 1: Illustration for a web document from Wikipedia which is well-written with clearly defined paragraphs.
Figure 2: Illustration for the framework of active learning.
Figure 3: Domain statistics for the training and test queries in $\rm T^2Ranking$.
Figure 4: Pie chart of the annotation distribution.
Figure 5: Illustration for the training process of baselines used in our experiments. First, we train a dual-encoder with BM25 negatives, which is similar to DPR karpukhin2020dense. Second, we train the dual-encoder and cross-encoder with the global negative sampling strategy proposed in several studies long2022multiqiu2022dureader_retrieval.
...and 1 more figures

T2Ranking: A large-scale Chinese Benchmark for Passage Ranking

TL;DR

Abstract

T2Ranking: A large-scale Chinese Benchmark for Passage Ranking

Authors

TL;DR

Abstract

Table of Contents

Figures (6)