Table of Contents
Fetching ...

Semi-Supervised Synthetic Data Generation with Fine-Grained Relevance Control for Short Video Search Relevance Modeling

Haoran Li, Zhiming Su, Junyan Yao, Enwei Zhang, Yang Ji, Yan Chen, Kan Zhou, Chao Feng, Jiao Ran

TL;DR

This work tackles the gap between synthetic data and domain-specific, fine-grained relevance needs in short-video search. It introduces a 4-level relevance dataset and a semi-supervised SSRA pipeline that generates controllable, relevance-labeled data by combining score-based re-annotation and iterative refinement. Offline experiments on dual-size backbones show SSRA consistently improves retrieval and pairwise classification metrics, outperforming prompt-based and vanilla SFT baselines. Online deployment in Douyin demonstrates tangible gains in CTR, SRR, and IUPR, underscoring the practical value of fine-grained relevance supervision in embedding learning.

Abstract

Synthetic data is widely adopted in embedding models to ensure diversity in training data distributions across dimensions such as difficulty, length, and language. However, existing prompt-based synthesis methods struggle to capture domain-specific data distributions, particularly in data-scarce domains, and often overlook fine-grained relevance diversity. In this paper, we present a Chinese short video dataset with 4-level relevance annotations, filling a critical resource void. Further, we propose a semi-supervised synthetic data pipeline where two collaboratively trained models generate domain-adaptive short video data with controllable relevance labels. Our method enhances relevance-level diversity by synthesizing samples for underrepresented intermediate relevance labels, resulting in a more balanced and semantically rich training data set. Extensive offline experiments show that the embedding model trained on our synthesized data outperforms those using data generated based on prompting or vanilla supervised fine-tuning(SFT). Moreover, we demonstrate that incorporating more diverse fine-grained relevance levels in training data enhances the model's sensitivity to subtle semantic distinctions, highlighting the value of fine-grained relevance supervision in embedding learning. In the search enhanced recommendation pipeline of Douyin's dual-column scenario, through online A/B testing, the proposed model increased click-through rate(CTR) by 1.45%, raised the proportion of Strong Relevance Ratio (SRR) by 4.9%, and improved the Image User Penetration Rate (IUPR) by 0.1054%.

Semi-Supervised Synthetic Data Generation with Fine-Grained Relevance Control for Short Video Search Relevance Modeling

TL;DR

This work tackles the gap between synthetic data and domain-specific, fine-grained relevance needs in short-video search. It introduces a 4-level relevance dataset and a semi-supervised SSRA pipeline that generates controllable, relevance-labeled data by combining score-based re-annotation and iterative refinement. Offline experiments on dual-size backbones show SSRA consistently improves retrieval and pairwise classification metrics, outperforming prompt-based and vanilla SFT baselines. Online deployment in Douyin demonstrates tangible gains in CTR, SRR, and IUPR, underscoring the practical value of fine-grained relevance supervision in embedding learning.

Abstract

Synthetic data is widely adopted in embedding models to ensure diversity in training data distributions across dimensions such as difficulty, length, and language. However, existing prompt-based synthesis methods struggle to capture domain-specific data distributions, particularly in data-scarce domains, and often overlook fine-grained relevance diversity. In this paper, we present a Chinese short video dataset with 4-level relevance annotations, filling a critical resource void. Further, we propose a semi-supervised synthetic data pipeline where two collaboratively trained models generate domain-adaptive short video data with controllable relevance labels. Our method enhances relevance-level diversity by synthesizing samples for underrepresented intermediate relevance labels, resulting in a more balanced and semantically rich training data set. Extensive offline experiments show that the embedding model trained on our synthesized data outperforms those using data generated based on prompting or vanilla supervised fine-tuning(SFT). Moreover, we demonstrate that incorporating more diverse fine-grained relevance levels in training data enhances the model's sensitivity to subtle semantic distinctions, highlighting the value of fine-grained relevance supervision in embedding learning. In the search enhanced recommendation pipeline of Douyin's dual-column scenario, through online A/B testing, the proposed model increased click-through rate(CTR) by 1.45%, raised the proportion of Strong Relevance Ratio (SRR) by 4.9%, and improved the Image User Penetration Rate (IUPR) by 0.1054%.

Paper Structure

This paper contains 41 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: The overview of the proposed two-stage Semi-Supervised Relevance-Aware data synthesis(SSRA) pipeline. Steps 1–4 on the left (highlighted in blue) correspond to Stage 1, while steps 5–7 on the right (highlighted in green) represent Stage 2. The term D2Q refers to a data structure where each document is associated with multiple queries.
  • Figure 2: Illustration of our score-based re-annotation strategy in Stage 1. $Si$ represents the target relevance score. Left: In the original labeled dataset, multiple documents are associated with the same high-frequency query, leading to reduced diversity in query generation. Right: After Stage 1 re-annotation, each document is linked to multiple queries with varying relevance labels, as labeled by a tuned score model. This enables the query generation model to synthesize diverse queries for unseen documents during inference.
  • Figure 3: Prompt for short video content rewriting.
  • Figure 4: Prompt for score model relevance reasoning data construction.
  • Figure 5: Prompt for score model training and inference.
  • ...and 2 more figures