Table of Contents
Fetching ...

Introducing Semantic Capability in LinkedIn's Content Search Engine

Xin Yang, Rachel Zheng, Madhumitha Mohan, Sonali Bhadra, Pansul Bhatt, Lingyu, Zhang, Rupesh Gupta

TL;DR

The paper addresses the challenge of increasingly long and natural-language search queries by introducing semantic capability into LinkedIn's content search engine. It presents a two-layer architecture with retrieval (token-based and embedding-based) and a two-stage ranking (L1/L2) that leverages a two-tower embedding model based on multilingual-e5, plus precomputed post embeddings and approximate nearest neighbor search. A weighted score combining on-topicness and long-dwell ($score = \alpha \cdot \text{on-topicness} + (1-\alpha) \cdot \text{long-dwell}$) guides ranking, with \(\alpha\) tuned online. The approach yields improvements of over 10% in both on-topic rate and long-dwell, and positively impacts sitewide sessions, demonstrating practical benefits in real-world search experience. Future work aims to refine quality metrics and integrate an LLM into ranking for deeper language understanding and relevance.

Abstract

In the past, most search queries issued to a search engine were short and simple. A keyword based search engine was able to answer such queries quite well. However, members are now developing the habit of issuing long and complex natural language queries. Answering such queries requires evolution of a search engine to have semantic capability. In this paper we present the design of LinkedIn's new content search engine with semantic capability, and its impact on metrics.

Introducing Semantic Capability in LinkedIn's Content Search Engine

TL;DR

The paper addresses the challenge of increasingly long and natural-language search queries by introducing semantic capability into LinkedIn's content search engine. It presents a two-layer architecture with retrieval (token-based and embedding-based) and a two-stage ranking (L1/L2) that leverages a two-tower embedding model based on multilingual-e5, plus precomputed post embeddings and approximate nearest neighbor search. A weighted score combining on-topicness and long-dwell () guides ranking, with tuned online. The approach yields improvements of over 10% in both on-topic rate and long-dwell, and positively impacts sitewide sessions, demonstrating practical benefits in real-world search experience. Future work aims to refine quality metrics and integrate an LLM into ranking for deeper language understanding and relevance.

Abstract

In the past, most search queries issued to a search engine were short and simple. A keyword based search engine was able to answer such queries quite well. However, members are now developing the habit of issuing long and complex natural language queries. Answering such queries requires evolution of a search engine to have semantic capability. In this paper we present the design of LinkedIn's new content search engine with semantic capability, and its impact on metrics.
Paper Structure (9 sections, 1 equation, 4 figures)

This paper contains 9 sections, 1 equation, 4 figures.

Figures (4)

  • Figure 1: High-level design of the content search engine consisting of a retrieval layer and a multi-stage ranking layer.
  • Figure 2: Architecture of the two-tower model used in EBR.
  • Figure 3: Approximate nearest neighbor search in EBR using precomputed post embeddings (green) and real-time computed query embedding (pink).
  • Figure 4: Architecture of the models used in L1 and L2 ranking stages.