NeuCLIRTech: Chinese Monolingual and Cross-Language Information Retrieval Evaluation in a Challenging Domain

Dawn Lawrie; James Mayfield; Eugene Yang; Andrew Yates; Sean MacAvaney; Ronak Pradeep; Scott Miller; Paul McNamee; Luca Soldaini

NeuCLIRTech: Chinese Monolingual and Cross-Language Information Retrieval Evaluation in a Challenging Domain

Dawn Lawrie, James Mayfield, Eugene Yang, Andrew Yates, Sean MacAvaney, Ronak Pradeep, Scott Miller, Paul McNamee, Luca Soldaini

TL;DR

NeuCLIRTech addresses the need for robust CLIR benchmarks in the technical Chinese domain by creating a large, human-judged collection supporting monolingual Chinese IR and English-query CLIR. The dataset fuses 2023–2024 NeuCLIR topics and provides deep relevance judgments plus a fusion baseline to evaluate rerankers beyond BM25. Experiments show that Qwen3-8B-based embeddings provide the strongest first-stage retrieval, yet cross-language performance remains challenging, with some rerankers failing to improve over the first stage. The dataset, released on Huggingface Datasets, offers a valuable resource for evaluating first-stage and reranking methods on scientific abstracts and motivates further domain-adaptation research. The evaluation reports $nDCG@20$ and $Judged@20$ metrics to quantify discriminatory power in this technical CLIR setting.

Abstract

Measuring advances in retrieval requires test collections with relevance judgments that can faithfully distinguish systems. This paper presents NeuCLIRTech, an evaluation collection for cross-language retrieval over technical information. The collection consists of technical documents written natively in Chinese and those same documents machine translated into English. It includes 110 queries with relevance judgments. The collection supports two retrieval scenarios: monolingual retrieval in Chinese, and cross-language retrieval with English as the query language. NeuCLIRTech combines the TREC NeuCLIR track topics of 2023 and 2024. The 110 queries with 35,962 document judgments provide strong statistical discriminatory power when trying to distinguish retrieval approaches. A fusion baseline of strong neural retrieval systems is included so that developers of reranking algorithms are not reliant on BM25 as their first stage retriever. The dataset and artifacts are released on Huggingface Datasets

NeuCLIRTech: Chinese Monolingual and Cross-Language Information Retrieval Evaluation in a Challenging Domain

TL;DR

and

metrics to quantify discriminatory power in this technical CLIR setting.

Abstract

Paper Structure (27 sections, 6 figures, 1 table)

This paper contains 27 sections, 6 figures, 1 table.

Introduction
Related Work
Dataset Creation
Documents
Queries
Relevance Judgments
Results
Conclusion
Model Descriptions
Sparse Retrieval
Multi-Dense Vector Retrieval
Learned Sparse Retrieval
Dense Retrieval
First-Stage Retrieval Fusion
Rerankers
...and 12 more sections

Figures (6)

Figure 1: Example document from the CSL dataset.
Figure 2: Screenshot of the interface that searches the collection. A query in English or Chinese can be entered. Each ranked document is clicked on the link on the left to display the contents in the middle panel. A document is judged by answering the questions in the right panel.
Figure 3: Screenshot of the interface where the assessor records information about the topic.
Figure 4: Screenshot of the interface where the assessor can modify the judgments of documents that have already been judged.
Figure 5: The screenshot of the final tab where assessors complete the task by entering translations of the title and description in Chinese. They had to rate how easy it was to find central information and optionally could add a comment about the topic.
...and 1 more figures

NeuCLIRTech: Chinese Monolingual and Cross-Language Information Retrieval Evaluation in a Challenging Domain

TL;DR

Abstract

NeuCLIRTech: Chinese Monolingual and Cross-Language Information Retrieval Evaluation in a Challenging Domain

Authors

TL;DR

Abstract

Table of Contents

Figures (6)