Introducing Semantic Capability in LinkedIn's Content Search Engine
Xin Yang, Rachel Zheng, Madhumitha Mohan, Sonali Bhadra, Pansul Bhatt, Lingyu, Zhang, Rupesh Gupta
TL;DR
The paper addresses the challenge of increasingly long and natural-language search queries by introducing semantic capability into LinkedIn's content search engine. It presents a two-layer architecture with retrieval (token-based and embedding-based) and a two-stage ranking (L1/L2) that leverages a two-tower embedding model based on multilingual-e5, plus precomputed post embeddings and approximate nearest neighbor search. A weighted score combining on-topicness and long-dwell ($score = \alpha \cdot \text{on-topicness} + (1-\alpha) \cdot \text{long-dwell}$) guides ranking, with \(\alpha\) tuned online. The approach yields improvements of over 10% in both on-topic rate and long-dwell, and positively impacts sitewide sessions, demonstrating practical benefits in real-world search experience. Future work aims to refine quality metrics and integrate an LLM into ranking for deeper language understanding and relevance.
Abstract
In the past, most search queries issued to a search engine were short and simple. A keyword based search engine was able to answer such queries quite well. However, members are now developing the habit of issuing long and complex natural language queries. Answering such queries requires evolution of a search engine to have semantic capability. In this paper we present the design of LinkedIn's new content search engine with semantic capability, and its impact on metrics.
