LiNR: Model Based Neural Retrieval on GPUs at LinkedIn

Fedor Borisyuk; Qingquan Song; Mingzhou Zhou; Ganesh Parameswaran; Madhu Arun; Siva Popuri; Tugrul Bingol; Zhuotao Pei; Kuang-Hsuan Lee; Lu Zheng; Qizhan Shao; Ali Naqvi; Sen Zhou; Aman Gupta

LiNR: Model Based Neural Retrieval on GPUs at LinkedIn

Fedor Borisyuk, Qingquan Song, Mingzhou Zhou, Ganesh Parameswaran, Madhu Arun, Siva Popuri, Tugrul Bingol, Zhuotao Pei, Kuang-Hsuan Lee, Lu Zheng, Qizhan Shao, Ali Naqvi, Sen Zhou, Aman Gupta

TL;DR

LiNR presents a GPU-first, live-updated model-based retrieval system for LinkedIn's embedding-based recommendations, enabling near-real-time, Differentiable retrieval and joint optimization of retrieval and ranking. The approach combines exhaustive ABM with attribute filtering, 1-bit Sign-OPORP quantization, and a Mixture-of-Logits framework with clustering and residual IDs to handle cold-starts and multi-embedding signals. Comprehensive offline and online experiments show latency-qualifying performance, improved engagement metrics, and robust live-update behavior, with a path to unifying retrieval and ranking on a single GPU model. The work demonstrates practical deployment lessons, including custom CUDA pre-filtering kernels and nearline data ingestion, making model-based retrieval viable at industrial scale.

Abstract

This paper introduces LiNR, LinkedIn's large-scale, GPU-based retrieval system. LiNR supports a billion-sized index on GPU models. We discuss our experiences and challenges in creating scalable, differentiable search indexes using TensorFlow and PyTorch at production scale. In LiNR, both items and model weights are integrated into the model binary. Viewing index construction as a form of model training, we describe scaling our system for large indexes, incorporating full scans and efficient filtering. A key focus is on enabling attribute-based pre-filtering for exhaustive GPU searches, addressing the common challenge of post-filtering in KNN searches that often reduces system quality. We further provide multi-embedding retrieval algorithms and strategies for tackling cold start issues in retrieval. Our advancements in supporting larger indexes through quantization are also discussed. We believe LiNR represents one of the industry's first Live-updated model-based retrieval indexes. Applied to out-of-network post recommendations on LinkedIn Feed, LiNR has contributed to a 3% relative increase in professional daily active users. We envisage LiNR as a step towards integrating retrieval and ranking into a single GPU model, simplifying complex infrastructures and enabling end-to-end optimization of the entire differentiable infrastructure through gradient descent.

LiNR: Model Based Neural Retrieval on GPUs at LinkedIn

TL;DR

Abstract

Paper Structure (28 sections, 7 figures, 5 tables)

This paper contains 28 sections, 7 figures, 5 tables.

Introduction
Related Work
Modeling Technology
Exhaustive Search with Attribute-Based Matching (ABM)
KNN with Similarity Masking
KNN with Explicit Pre-Filtering
Quantized KNN
Similarity Modeling
Hadamard MLP
Mixture-of-Logits with Clustering
System Architecture
Out-of-Network Recommendations
ML Infra Architecture
Retriever
Ingestor
...and 13 more sections

Figures (7)

Figure 1: KNN with Similarity Masking. An example of five items with single query is used for illustration. Item similarities are computed and masked with two 0-1 vectors returned from the two clause checking. For each item, as long as one attribute is matched with the query attribute, the clause checking is passed (return one) in the masking matrix. The 2nd clause is a reverse matching clause. Top-1 selection is used in this example. D is dimension of item embedding.
Figure 2: KNN with Explicit Pre-Filtering. Clauses are checked one by one and a joint 0-1 mask vector is returned to retrieve the feasible items for matrix multiplication and top-K selection (K=1 here).
Figure 3: KNN with Quantized Filtering helps to reduce the number of retrieved items before the full precision similarity computation. A bit-wise matching is used to measure the approximated similarity between 1-bit quantized embedding obtained via Sign-OPORP method. We use bit-wise XOR operation and perform an integer bit-wise NOT conversion for query or item embedding in advance to measure the number of matched bits in the packed integer vector. The quantized KNN module can be used without full precision matrix multiplication when K is large in top-K selection.
Figure 4: Illustration of Hadamard MLP (left) and learning cluster id embedding for Mixture-of-Logits(right)
Figure 5: Feed OON Architechture.
...and 2 more figures

LiNR: Model Based Neural Retrieval on GPUs at LinkedIn

TL;DR

Abstract

LiNR: Model Based Neural Retrieval on GPUs at LinkedIn

Authors

TL;DR

Abstract

Table of Contents

Figures (7)