HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads
Guobing Gan, Kaiming Gao, Li Wang, Shen Jiang, Peng Jiang
TL;DR
HCMRM tackles the gap between vision–language pre-training and downstream query–video relevance in short video ads by introducing pseudo-query–video matching during pre-training and a symmetric hierarchical softmax loss for fine-tuning. Built on ALBEF, it reuses mainstream multimodal encoders while converting video text into a keyword sequence to enable efficient triplet modeling $ (Q, I, T) $. The approach yields superior offline ranking and online advertising performance, including a 6.1% reduction in irrelevant ads and a 1.4% revenue lift in production, and demonstrates the practicality of pseudo-queries for aligning pretraining with relevance tasks. It also shows that hierarchical softmax improves ranking without requiring heavy architectural changes, and that domain-tuned models can outperform general large multimodal language models for short-video relevance tasks.
Abstract
Search advertising is essential for merchants to reach the target users on short video platforms. Short video ads aligned with user search intents are displayed through relevance matching and bid ranking mechanisms. This paper focuses on improving query-to-video relevance matching to enhance the effectiveness of ranking in ad systems. Recent vision-language pre-training models have demonstrated promise in various multimodal tasks. However, their contribution to downstream query-video relevance tasks is limited, as the alignment between the pair of visual signals and text differs from the modeling of the triplet of the query, visual signals, and video text. In addition, our previous relevance model provides limited ranking capabilities, largely due to the discrepancy between the binary cross-entropy fine-tuning objective and the ranking objective. To address these limitations, we design a high-consistency multimodal relevance model (HCMRM). It utilizes a simple yet effective method to enhance the consistency between pre-training and relevance tasks. Specifically, during the pre-training phase, along with aligning visual signals and video text, several keywords are extracted from the video text as pseudo-queries to perform the triplet relevance modeling. For the fine-tuning phase, we introduce a hierarchical softmax loss, which enables the model to learn the order within labels while maximizing the distinction between positive and negative samples. This promotes the fusion ranking of relevance and bidding in the subsequent ranking stage. The proposed method has been deployed in the Kuaishou search advertising system for over a year, contributing to a 6.1% reduction in the proportion of irrelevant ads and a 1.4% increase in ad revenue.
