Table of Contents
Fetching ...

Efficient user history modeling with amortized inference for deep learning recommendation models

Lars Hertel, Neil Daftary, Fedor Borisyuk, Aman Gupta, Rahul Mazumder

TL;DR

This paper tackles the latency challenge of Transformer-based user history encoders in deep learning recommendation models (DLRM). It systematically compares early-fusion strategies—appending the candidate to the history versus concatenating the candidate to each history item—and introduces an amortized inference variant using cross-attention when appending, with equations such as $Q=W_q [H_1, \ldots, H_n, C]$, $K=W_k [H_1, \ldots, H_n]$, and $V=W_v [H_1, \ldots, H_n]$. The authors show that appending with cross-attention yields comparable predictive performance to concatenation on both public and internal datasets, while drastically reducing inference cost via amortization, which scales with longer histories as shown by a complexity shift from $O(l m n d^2 + l m n^2 d)$ to $O(l (n+m) d^2 + l (n+m)^2 d)$. Deployment on LinkedIn Feed and Ads demonstrates substantial latency reductions (and associated CPU savings) with amortized inference, along with engagement gains on Feed, underscoring practical impact for real-world ranking systems.

Abstract

We study user history modeling via Transformer encoders in deep learning recommendation models (DLRM). Such architectures can significantly improve recommendation quality, but usually incur high latency cost necessitating infrastructure upgrades or very small Transformer models. An important part of user history modeling is early fusion of the candidate item and various methods have been studied. We revisit early fusion and compare concatenation of the candidate to each history item against appending it to the end of the list as a separate item. Using the latter method, allows us to reformulate the recently proposed amortized history inference algorithm M-FALCON \cite{zhai2024actions} for the case of DLRM models. We show via experimental results that appending with cross-attention performs on par with concatenation and that amortization significantly reduces inference costs. We conclude with results from deploying this model on the LinkedIn Feed and Ads surfaces, where amortization reduces latency by 30\% compared to non-amortized inference.

Efficient user history modeling with amortized inference for deep learning recommendation models

TL;DR

This paper tackles the latency challenge of Transformer-based user history encoders in deep learning recommendation models (DLRM). It systematically compares early-fusion strategies—appending the candidate to the history versus concatenating the candidate to each history item—and introduces an amortized inference variant using cross-attention when appending, with equations such as , , and . The authors show that appending with cross-attention yields comparable predictive performance to concatenation on both public and internal datasets, while drastically reducing inference cost via amortization, which scales with longer histories as shown by a complexity shift from to . Deployment on LinkedIn Feed and Ads demonstrates substantial latency reductions (and associated CPU savings) with amortized inference, along with engagement gains on Feed, underscoring practical impact for real-world ranking systems.

Abstract

We study user history modeling via Transformer encoders in deep learning recommendation models (DLRM). Such architectures can significantly improve recommendation quality, but usually incur high latency cost necessitating infrastructure upgrades or very small Transformer models. An important part of user history modeling is early fusion of the candidate item and various methods have been studied. We revisit early fusion and compare concatenation of the candidate to each history item against appending it to the end of the list as a separate item. Using the latter method, allows us to reformulate the recently proposed amortized history inference algorithm M-FALCON \cite{zhai2024actions} for the case of DLRM models. We show via experimental results that appending with cross-attention performs on par with concatenation and that amortization significantly reduces inference costs. We conclude with results from deploying this model on the LinkedIn Feed and Ads surfaces, where amortization reduces latency by 30\% compared to non-amortized inference.

Paper Structure

This paper contains 11 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of regular inference (left) for user action history architectures vs. amortized inference (right). History items are shown in grey and candidate items in red. In amortized inference candidate items are added to the sequence causing the Transformer to only process one sample per request instead of $m$ samples.
  • Figure 2: Attention activations from the first Transformer layer for early fusion with concatenating (top) and appending with cross-attention (bottom) for a positive Feed training example.
  • Figure 3: Inference time for 100 forward passes using regular vs. amortized inference on CPU and GPU.