Table of Contents
Fetching ...

Query-oriented Data Augmentation for Session Search

Haonan Chen, Zhicheng Dou, Yutao Zhu, Ji-Rong Wen

TL;DR

The paper tackles the challenge of modeling session context in session-based document ranking by revealing a symmetry gap: the relevance between a candidate document and a session context can vary with changes to the current query. It introduces QASS, a two-stage approach that augments data by altering the current query at term- and query-levels, producing negative sequences of varying difficulty (easy via random queries, medium via historical/term modifications, hard via ambiguous queries) and training with a BERT-based sequence scorer using pairwise losses and carefully tuned margins. Empirical results on AOL and Tiangong-ST show that QASS outperforms strong baselines, with notable gains from ambiguous-query mining and term-level perturbations, while maintaining comparable online inference cost. The work contributes a novel direction in session-search learning by performing negative sampling on the query side, offering practical improvements for building context-aware retrieval systems and opening avenues for curriculum-like training and broader base-model deployment in future work.

Abstract

Modeling contextual information in a search session has drawn more and more attention when understanding complex user intents. Recent methods are all data-driven, i.e., they train different models on large-scale search log data to identify the relevance between search contexts and candidate documents. The common training paradigm is to pair the search context with different candidate documents and train the model to rank the clicked documents higher than the unclicked ones. However, this paradigm neglects the symmetric nature of the relevance between the session context and document, i.e., the clicked documents can also be paired with different search contexts when training. In this work, we propose query-oriented data augmentation to enrich search logs and empower the modeling. We generate supplemental training pairs by altering the most important part of a search context, i.e., the current query, and train our model to rank the generated sequence along with the original sequence. This approach enables models to learn that the relevance of a document may vary as the session context changes, leading to a better understanding of users' search patterns. We develop several strategies to alter the current query, resulting in new training data with varying degrees of difficulty. Through experimentation on two extensive public search logs, we have successfully demonstrated the effectiveness of our model.

Query-oriented Data Augmentation for Session Search

TL;DR

The paper tackles the challenge of modeling session context in session-based document ranking by revealing a symmetry gap: the relevance between a candidate document and a session context can vary with changes to the current query. It introduces QASS, a two-stage approach that augments data by altering the current query at term- and query-levels, producing negative sequences of varying difficulty (easy via random queries, medium via historical/term modifications, hard via ambiguous queries) and training with a BERT-based sequence scorer using pairwise losses and carefully tuned margins. Empirical results on AOL and Tiangong-ST show that QASS outperforms strong baselines, with notable gains from ambiguous-query mining and term-level perturbations, while maintaining comparable online inference cost. The work contributes a novel direction in session-search learning by performing negative sampling on the query side, offering practical improvements for building context-aware retrieval systems and opening avenues for curriculum-like training and broader base-model deployment in future work.

Abstract

Modeling contextual information in a search session has drawn more and more attention when understanding complex user intents. Recent methods are all data-driven, i.e., they train different models on large-scale search log data to identify the relevance between search contexts and candidate documents. The common training paradigm is to pair the search context with different candidate documents and train the model to rank the clicked documents higher than the unclicked ones. However, this paradigm neglects the symmetric nature of the relevance between the session context and document, i.e., the clicked documents can also be paired with different search contexts when training. In this work, we propose query-oriented data augmentation to enrich search logs and empower the modeling. We generate supplemental training pairs by altering the most important part of a search context, i.e., the current query, and train our model to rank the generated sequence along with the original sequence. This approach enables models to learn that the relevance of a document may vary as the session context changes, leading to a better understanding of users' search patterns. We develop several strategies to alter the current query, resulting in new training data with varying degrees of difficulty. Through experimentation on two extensive public search logs, we have successfully demonstrated the effectiveness of our model.
Paper Structure (29 sections, 11 equations, 6 figures, 6 tables)

This paper contains 29 sections, 11 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: An illustration of our augmented training pairs. The existing training paradigm constructs training samples by pairing different candidate documents with a fixed search context, while we pair fixed clicked document with the original search context and the one with the modified current query.
  • Figure 2: Illustration of QASS. The current query $q_\text{c}$ of the original user behavior sequence $S$ is altered to construct the augmented sequence $S'$. The clicked document $d_\text{c}$ is hypothesized to be more relevant to the original search context than the altered context (i.e., $P(S)>P(S')$).
  • Figure 3: Illustration of mining the ambiguous queries. We first use a ranking model to obtain a ranking list of all documents for each query. Then a window of negative documents is sampled around each query's clicked document. If $d_\text{c}$ is in the window of a query $q_\text{c}'$, this query $q_\text{c}'$ is an ambiguous query of $q_\text{c}$. The closer $d_\text{c}$ is to the clicked document of $q_\text{c}'$ ($d_\text{c}'$), the more ambiguous $q_\text{c}'$ is to $q_\text{c}$.
  • Figure 4: Influence of the number of generated data pairs and the score margins.
  • Figure 5: Performances on different lengths of sessions on AOL search log.
  • ...and 1 more figures