Table of Contents
Fetching ...

HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads

Guobing Gan, Kaiming Gao, Li Wang, Shen Jiang, Peng Jiang

TL;DR

HCMRM tackles the gap between vision–language pre-training and downstream query–video relevance in short video ads by introducing pseudo-query–video matching during pre-training and a symmetric hierarchical softmax loss for fine-tuning. Built on ALBEF, it reuses mainstream multimodal encoders while converting video text into a keyword sequence to enable efficient triplet modeling $ (Q, I, T) $. The approach yields superior offline ranking and online advertising performance, including a 6.1% reduction in irrelevant ads and a 1.4% revenue lift in production, and demonstrates the practicality of pseudo-queries for aligning pretraining with relevance tasks. It also shows that hierarchical softmax improves ranking without requiring heavy architectural changes, and that domain-tuned models can outperform general large multimodal language models for short-video relevance tasks.

Abstract

Search advertising is essential for merchants to reach the target users on short video platforms. Short video ads aligned with user search intents are displayed through relevance matching and bid ranking mechanisms. This paper focuses on improving query-to-video relevance matching to enhance the effectiveness of ranking in ad systems. Recent vision-language pre-training models have demonstrated promise in various multimodal tasks. However, their contribution to downstream query-video relevance tasks is limited, as the alignment between the pair of visual signals and text differs from the modeling of the triplet of the query, visual signals, and video text. In addition, our previous relevance model provides limited ranking capabilities, largely due to the discrepancy between the binary cross-entropy fine-tuning objective and the ranking objective. To address these limitations, we design a high-consistency multimodal relevance model (HCMRM). It utilizes a simple yet effective method to enhance the consistency between pre-training and relevance tasks. Specifically, during the pre-training phase, along with aligning visual signals and video text, several keywords are extracted from the video text as pseudo-queries to perform the triplet relevance modeling. For the fine-tuning phase, we introduce a hierarchical softmax loss, which enables the model to learn the order within labels while maximizing the distinction between positive and negative samples. This promotes the fusion ranking of relevance and bidding in the subsequent ranking stage. The proposed method has been deployed in the Kuaishou search advertising system for over a year, contributing to a 6.1% reduction in the proportion of irrelevant ads and a 1.4% increase in ad revenue.

HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads

TL;DR

HCMRM tackles the gap between vision–language pre-training and downstream query–video relevance in short video ads by introducing pseudo-query–video matching during pre-training and a symmetric hierarchical softmax loss for fine-tuning. Built on ALBEF, it reuses mainstream multimodal encoders while converting video text into a keyword sequence to enable efficient triplet modeling . The approach yields superior offline ranking and online advertising performance, including a 6.1% reduction in irrelevant ads and a 1.4% revenue lift in production, and demonstrates the practicality of pseudo-queries for aligning pretraining with relevance tasks. It also shows that hierarchical softmax improves ranking without requiring heavy architectural changes, and that domain-tuned models can outperform general large multimodal language models for short-video relevance tasks.

Abstract

Search advertising is essential for merchants to reach the target users on short video platforms. Short video ads aligned with user search intents are displayed through relevance matching and bid ranking mechanisms. This paper focuses on improving query-to-video relevance matching to enhance the effectiveness of ranking in ad systems. Recent vision-language pre-training models have demonstrated promise in various multimodal tasks. However, their contribution to downstream query-video relevance tasks is limited, as the alignment between the pair of visual signals and text differs from the modeling of the triplet of the query, visual signals, and video text. In addition, our previous relevance model provides limited ranking capabilities, largely due to the discrepancy between the binary cross-entropy fine-tuning objective and the ranking objective. To address these limitations, we design a high-consistency multimodal relevance model (HCMRM). It utilizes a simple yet effective method to enhance the consistency between pre-training and relevance tasks. Specifically, during the pre-training phase, along with aligning visual signals and video text, several keywords are extracted from the video text as pseudo-queries to perform the triplet relevance modeling. For the fine-tuning phase, we introduce a hierarchical softmax loss, which enables the model to learn the order within labels while maximizing the distinction between positive and negative samples. This promotes the fusion ranking of relevance and bidding in the subsequent ranking stage. The proposed method has been deployed in the Kuaishou search advertising system for over a year, contributing to a 6.1% reduction in the proportion of irrelevant ads and a 1.4% increase in ad revenue.

Paper Structure

This paper contains 27 sections, 10 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: An example of the short video ad, including both video signals and text information. (A) is a diagram of the video image processing procedure. The uniformly sampled frames are blocked and spliced before being input into the model. The upper part of (B) shows that the video text consists of keywords extracted from multiple text fields in the video, arranged in descending order of importance. The lower part of (B) illustrates the process of generating pseudo-queries in the pre-training.
  • Figure 2: Overview of HCMRM. It is built based on ALBEF with minimal modifications and pre-trained using four objectives: Image-Text Contrastive Learning (ITC), Image-Text Matching (ITM), Masked Language Modeling (MLM), and Pseudo-Query-Video Matching (PQVM). Note that the downstream relevance task between query and short video ad is consistent with PQVM.
  • Figure 3: The hierarchical binary structure within the labels.