Fine-grained large-scale content recommendations for MSX sellers

Manpreet Singh; Ravdeep Pasricha; Ravi Prasad Kondapalli; Kiran R; Nitish Singh; Akshita Agarwalla; Manoj R; Manish Prabhakar; Laurent Boué

Fine-grained large-scale content recommendations for MSX sellers

Manpreet Singh, Ravdeep Pasricha, Ravi Prasad Kondapalli, Kiran R, Nitish Singh, Akshita Agarwalla, Manoj R, Manish Prabhakar, Laurent Boué

TL;DR

This work tackles the challenge of surface-relevant content for each MSX opportunity by formulating a large-scale semantic matching pipeline that links opportunity context with Seismic metadata. It employs a two-stage retrieval architecture (bi-encoder candidate retrieval followed by cross-encoder re-ranking) with metadata-driven prompts and weekly content updates, designed to scale to ~$7\times 10^5$ opportunities and ~$4\times 10^4$ documents. The approach is evaluated through human expert judgments and LLM-based proxies, showing strong alignment between model scores and expert ratings (e.g., $r=0.78$, $\rho=0.64$) and feasible use of GPT-4 as a judge ($r=0.42$, $\rho=0.57$), while delivering practical runtime improvements via Pandas UDFs on Azure Databricks (≈$2\ \mathrm{s}$ to ≈$90\ \mathrm{ms}$ per opportunity on a $96$-vcore cluster). Integrated into MSX Copilot, the system provides sellers with top-5 customer-ready or private-content recommendations, enabling more targeted engagement and faster deal velocity; future work includes personalization and multi-modal content handling to further boost relevance and impact.

Abstract

One of the most critical tasks of Microsoft sellers is to meticulously track and nurture potential business opportunities through proactive engagement and tailored solutions. Recommender systems play a central role to help sellers achieve their goals. In this paper, we present a content recommendation model which surfaces various types of content (technical documentation, comparison with competitor products, customer success stories etc.) that sellers can share with their customers or use for their own self-learning. The model operates at the opportunity level which is the lowest possible granularity and the most relevant one for sellers. It is based on semantic matching between metadata from the contents and carefully selected attributes of the opportunities. Considering the volume of seller-managed opportunities in organizations such as Microsoft, we show how to perform efficient semantic matching over a very large number of opportunity-content combinations. The main challenge is to ensure that the top-5 relevant contents for each opportunity are recommended out of a total of $\approx 40,000$ published contents. We achieve this target through an extensive comparison of different model architectures and feature selection. Finally, we further examine the quality of the recommendations in a quantitative manner using a combination of human domain experts as well as by using the recently proposed "LLM as a judge" framework.

Fine-grained large-scale content recommendations for MSX sellers

TL;DR

opportunities and ~

documents. The approach is evaluated through human expert judgments and LLM-based proxies, showing strong alignment between model scores and expert ratings (e.g.,

) and feasible use of GPT-4 as a judge (

), while delivering practical runtime improvements via Pandas UDFs on Azure Databricks (≈

to ≈

per opportunity on a

-vcore cluster). Integrated into MSX Copilot, the system provides sellers with top-5 customer-ready or private-content recommendations, enabling more targeted engagement and faster deal velocity; future work includes personalization and multi-modal content handling to further boost relevance and impact.

Abstract

published contents. We achieve this target through an extensive comparison of different model architectures and feature selection. Finally, we further examine the quality of the recommendations in a quantitative manner using a combination of human domain experts as well as by using the recently proposed "LLM as a judge" framework.

Paper Structure (14 sections, 1 equation, 6 figures)

This paper contains 14 sections, 1 equation, 6 figures.

Introduction
Large scale semantic matching for content recommendations
Prompt engineering
Model architecture
Orders of magnitude
Run-time performance optimization
Relevance/performance evaluation of the recommendations
Human expert evaluation and cross-encoder scores as a proxy
Ablation study
LLM as a judge
Integration in MSX
Conclusion
Acknowledgements
Appendix on MSX integration

Figures (6)

Figure 1: Top) Seismic documents are summarized into textual descriptions, referred to as "content prompts" based on their metadata. These prompts are then run through a DistillBERT language model sanh2019distilbert (pre-trained on MSMarco dataset). Those embeddings are refreshed on a weekly basis. Bottom) The $\delta$-opportunities (defined in the main part of the text) are gathered from Nebula nebula, which is an in-house ETL system developed by the SPS team. Next, the opportunity prompt (summarized attributes of the opportunity into a textual prompt in a manner similar to content prompts) is run through the same DistillBERT language model and compared to content embeddings to generate a list of top-50 candidate documents. Those candidates are re-ranked using the MSMarco MiniLM pre-trained cross-encoder. Finally, the top-5 results for each opportunity are stored in ADLS from where the Nebula insights pipeline pushes the results to a Cosmos database. The .NET API then pulls the recommended Seismic documents per opportunity from Cosmos and displays it in the UI (see Section \ref{['sec:MSXintegration']}).
Figure 2: Illustration of the performance gain by using Pandas UDFs on Azure Databricks Spark clusters. As expected, the processing time grows linearly with the number of opportunities. Further incremental gains may be obtained by increasing the size of the Spark cluster. Note that the number of opportunities is not the same as the number of records processed by the cross-encoder. Consider, for instance, that we have $1,000$ opportunities. In that case, the total number of records processed by cross encoder is $50 \times 1,000 = 50,000$ where the factor of 50 corresponds to the number of candidates retrieved in the first stage before re-ranking.
Figure 3: Illustration of the good alignment between cross-encoder scores and human judgment with a Pearson correlation coefficient of 0.78. Going beyond linear correlations, we have also estimated a rank-based Spearman's coefficient of 0.64.
Figure 4: Cross-encoder scores (as returned by the model) vs. each one of the 22 evaluation prompts. The dashed vertical lines represent the different groups A, B, C, and D. As explained in the main part of the text, the low performance of group B indicates the importance of "sales play" as a critical feature.
Figure 5: Correlation between average human experts scores and scores as judged by GPT4 for the same 22 evaluation queries. The red line corresponds to a linear fit with a Pearson correlation coefficient of 0.42 confirming the positive correlation between the different sets of scores. We have also estimated a Spearman's coefficient of 0.57.
...and 1 more figures

Fine-grained large-scale content recommendations for MSX sellers

TL;DR

Abstract

Fine-grained large-scale content recommendations for MSX sellers

Authors

TL;DR

Abstract

Table of Contents

Figures (6)