Improving Topic Relevance Model by Mix-structured Summarization and LLM-based Data Augmentation

Yizhu Liu; Ran Tao; Shengyu Guo; Yifan Yang

Improving Topic Relevance Model by Mix-structured Summarization and LLM-based Data Augmentation

Yizhu Liu, Ran Tao, Shengyu Guo, Yifan Yang

TL;DR

This work tackles the difficulty of modeling topic relevance for long social-search documents under limited labeled data. It introduces mix-structured summarization to jointly capture query-related content and core document information, and leverages LLM-based data augmentation (query rewriting and generation) to expand the training set. Offline results show substantial gains, with the MSD-LLM-CE variant achieving the highest AUC, while online tests confirm improved user satisfaction. The proposed methods are practical for real-world, long-document relevance tasks and require minimal changes to deployment.

Abstract

Topic relevance between query and document is a very important part of social search, which can evaluate the degree of matching between document and user's requirement. In most social search scenarios such as Dianping, modeling search relevance always faces two challenges. One is that many documents in social search are very long and have much redundant information. The other is that the training data for search relevance model is difficult to get, especially for multi-classification relevance model. To tackle above two problems, we first take query concatenated with the query-based summary and the document summary without query as the input of topic relevance model, which can help model learn the relevance degree between query and the core topic of document. Then, we utilize the language understanding and generation abilities of large language model (LLM) to rewrite and generate query from queries and documents in existing training data, which can construct new query-document pairs as training data. Extensive offline experiments and online A/B tests show that the proposed approaches effectively improve the performance of relevance modeling.

Improving Topic Relevance Model by Mix-structured Summarization and LLM-based Data Augmentation

TL;DR

Abstract

Paper Structure (18 sections, 3 equations, 3 figures, 4 tables)

This paper contains 18 sections, 3 equations, 3 figures, 4 tables.

Introduction
Approach
Mix-structure Summarization
Query-focused Summarization
Document Summarization
LLM-based Data Augmentation
Query Rewriting
Query Generation
Evaluation
Datasets
Experimental Setup
Evaluation Metrics
Offline Experimental Results
Query-focused Summarization vs. Mix-structured Summarization
With LLM-based Data Augmentation vs. Without LLM-based Data Augmentation
...and 3 more sections

Figures (3)

Figure 1: Illustration of mix-structured summarization.
Figure 2: The process of LLM-based query rewriting. Tuples of (Q, D, label) are new samples.
Figure 3: The process of LLM-based query generation. Tuples of (Q, D, label) are new samples.

Improving Topic Relevance Model by Mix-structured Summarization and LLM-based Data Augmentation

TL;DR

Abstract

Improving Topic Relevance Model by Mix-structured Summarization and LLM-based Data Augmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)