Table of Contents
Fetching ...

Towards Better Web Search Performance: Pre-training, Fine-tuning and Learning to Rank

Haitao Li, Jia Chen, Weihang Su, Qingyao Ai, Yiqun Liu

TL;DR

The paper tackles improving web search ranking through IR-tailored pre-training and diverse learning-to-rank (LTR) strategies. It pre-trains a 12-layer transformer on a filtered Baidu Web search corpus using MLM and CTR tasks, then fine-tunes on expert-annotated data, followed by extracting a rich set of features (statistical, axiomatic, semantic) for LTR. By combining multiple LTR models and a curated 20-feature set, the approach achieves strong performance on the WSDM Cup 2023 Pre-training for Web Search task, with a leader-board score around $DCG@10$ of 10.04. The results highlight the value of data curation, targeted pre-training objectives, and feature-based fusion for practical web search ranking improvements, and suggest future work on leveraging more search-log data for pre-training.

Abstract

This paper describes the approach of the THUIR team at the WSDM Cup 2023 Pre-training for Web Search task. This task requires the participant to rank the relevant documents for each query. We propose a new data pre-processing method and conduct pre-training and fine-tuning with the processed data. Moreover, we extract statistical, axiomatic, and semantic features to enhance the ranking performance. After the feature extraction, diverse learning-to-rank models are employed to merge those features. The experimental results show the superiority of our proposal. We finally achieve second place in this competition.

Towards Better Web Search Performance: Pre-training, Fine-tuning and Learning to Rank

TL;DR

The paper tackles improving web search ranking through IR-tailored pre-training and diverse learning-to-rank (LTR) strategies. It pre-trains a 12-layer transformer on a filtered Baidu Web search corpus using MLM and CTR tasks, then fine-tunes on expert-annotated data, followed by extracting a rich set of features (statistical, axiomatic, semantic) for LTR. By combining multiple LTR models and a curated 20-feature set, the approach achieves strong performance on the WSDM Cup 2023 Pre-training for Web Search task, with a leader-board score around of 10.04. The results highlight the value of data curation, targeted pre-training objectives, and feature-based fusion for practical web search ranking improvements, and suggest future work on leveraging more search-log data for pre-training.

Abstract

This paper describes the approach of the THUIR team at the WSDM Cup 2023 Pre-training for Web Search task. This task requires the participant to rank the relevant documents for each query. We propose a new data pre-processing method and conduct pre-training and fine-tuning with the processed data. Moreover, we extract statistical, axiomatic, and semantic features to enhance the ranking performance. After the feature extraction, diverse learning-to-rank models are employed to merge those features. The experimental results show the superiority of our proposal. We finally achieve second place in this competition.
Paper Structure (15 sections, 5 equations, 2 tables)