Pairwise Judgment Formulation for Semantic Embedding Model in Web Search

Mengze Hong; Di Jiang; Zichang Guo; Chen Jason Zhang

Pairwise Judgment Formulation for Semantic Embedding Model in Web Search

Mengze Hong, Di Jiang, Zichang Guo, Chen Jason Zhang

TL;DR

This work tackles how to construct high-quality pairwise supervision for Semantic Embedding Models in web search by examining strategies to derive preferences from large-scale query logs. It challenges standard Learning-to-Rank heuristics, proposing atomic and hybrid pairwise formulations and validating them with extensive queries and click data from a major search engine. The study finds that strategies like Clicked$>$Non-Examined yield the strongest SEM performance, and that a hybrid approach Clicked$>$Non-Clicked can improve data coverage with marginal gains. The findings offer practical best practices for SEM training and point to future directions involving richer signals and advanced language models to further improve semantic relevance in search.

Abstract

Semantic Embedding Models (SEMs) have become a core component in information retrieval and natural language processing due to their ability to model semantic relevance. However, despite its growing applications in search engines, few studies have systematically explored how to construct effective training data for SEMs from large-scale search engine query logs. In this paper, we present a comprehensive analysis of strategies for generating pairwise judgments as SEM training data. An interesting (perhaps surprising) discovery reveals that conventional formulation approaches used in Learning-to-Rank (LTR) are not necessarily optimal for SEM training. Through a large-scale empirical study using query logs and click-through data from a major search engine, we identify effective strategies and demonstrate the advantages of a proposed hybrid heuristic over simpler atomic heuristics. Finally, we provide best practices for SEM training and outline directions for future research.

Pairwise Judgment Formulation for Semantic Embedding Model in Web Search

TL;DR

Non-Examined yield the strongest SEM performance, and that a hybrid approach Clicked

Non-Clicked can improve data coverage with marginal gains. The findings offer practical best practices for SEM training and point to future directions involving richer signals and advanced language models to further improve semantic relevance in search.

Abstract

Paper Structure (10 sections, 11 equations, 2 figures, 1 table)

This paper contains 10 sections, 11 equations, 2 figures, 1 table.

Introduction
Related Work
Semantic Embedding Model for Web Search
Architecture of SEM for Web Search
Optimization
Experimental Setup
Atomic Strategies
Hybrid Strategy
Discussions
Conclusions

Figures (2)

Figure 1: Semantic Embedding Model Architecture
Figure 2: Performance of Pairwise Judgments Formed by Various Strategies for Training Semantic Embedding Models

Pairwise Judgment Formulation for Semantic Embedding Model in Web Search

TL;DR

Abstract

Pairwise Judgment Formulation for Semantic Embedding Model in Web Search

Authors

TL;DR

Abstract

Table of Contents

Figures (2)