Table of Contents
Fetching ...

ConvMix: A Mixed-Criteria Data Augmentation Framework for Conversational Dense Retrieval

Fengran Mo, Jinghan Zhang, Yuchen Hui, Jia Ao Sun, Zhichao Xu, Zhan Su, Jian-Yun Nie

TL;DR

ConvMix tackles data scarcity in conversational dense retrieval by introducing a mixed-criteria data augmentation framework that leverages LLMs to generate bidirectional relevance judgments (ConvMix-Q and ConvMix-D), enabling scalable diversification of context-dependent queries and their relevant documents. It employs semantic-diversity clustering and Fisher Information-based near-distribution supervision to select high-quality, informative samples and to mix augmented data with original data for fine-tuning a dense retriever (ANCE). Evaluations on five benchmarks show ConvMix-Combine achieving state-of-the-art or strong improvements over baselines, including robust out-of-domain performance on CAsT datasets. The work demonstrates a scalable path to enrich training data for conversational search and highlights effective use of LLMs for multi-aspect data generation with quality and distribution-aware selection.

Abstract

Conversational search aims to satisfy users' complex information needs via multiple-turn interactions. The key challenge lies in revealing real users' search intent from the context-dependent queries. Previous studies achieve conversational search by fine-tuning a conversational dense retriever with relevance judgments between pairs of context-dependent queries and documents. However, this training paradigm encounters data scarcity issues. To this end, we propose ConvMix, a mixed-criteria framework to augment conversational dense retrieval, which covers more aspects than existing data augmentation frameworks. We design a two-sided relevance judgment augmentation schema in a scalable manner via the aid of large language models. Besides, we integrate the framework with quality control mechanisms to obtain semantically diverse samples and near-distribution supervisions to combine various annotated data. Experimental results on five widely used benchmarks show that the conversational dense retriever trained by our ConvMix framework outperforms previous baseline methods, which demonstrates our superior effectiveness.

ConvMix: A Mixed-Criteria Data Augmentation Framework for Conversational Dense Retrieval

TL;DR

ConvMix tackles data scarcity in conversational dense retrieval by introducing a mixed-criteria data augmentation framework that leverages LLMs to generate bidirectional relevance judgments (ConvMix-Q and ConvMix-D), enabling scalable diversification of context-dependent queries and their relevant documents. It employs semantic-diversity clustering and Fisher Information-based near-distribution supervision to select high-quality, informative samples and to mix augmented data with original data for fine-tuning a dense retriever (ANCE). Evaluations on five benchmarks show ConvMix-Combine achieving state-of-the-art or strong improvements over baselines, including robust out-of-domain performance on CAsT datasets. The work demonstrates a scalable path to enrich training data for conversational search and highlights effective use of LLMs for multi-aspect data generation with quality and distribution-aware selection.

Abstract

Conversational search aims to satisfy users' complex information needs via multiple-turn interactions. The key challenge lies in revealing real users' search intent from the context-dependent queries. Previous studies achieve conversational search by fine-tuning a conversational dense retriever with relevance judgments between pairs of context-dependent queries and documents. However, this training paradigm encounters data scarcity issues. To this end, we propose ConvMix, a mixed-criteria framework to augment conversational dense retrieval, which covers more aspects than existing data augmentation frameworks. We design a two-sided relevance judgment augmentation schema in a scalable manner via the aid of large language models. Besides, we integrate the framework with quality control mechanisms to obtain semantically diverse samples and near-distribution supervisions to combine various annotated data. Experimental results on five widely used benchmarks show that the conversational dense retriever trained by our ConvMix framework outperforms previous baseline methods, which demonstrates our superior effectiveness.

Paper Structure

This paper contains 22 sections, 6 equations, 2 figures, 7 tables, 1 algorithm.

Figures (2)

  • Figure 1: Illustration of our ConvMix framework, including bidirectional augmentation, two quality control mechanisms with semantic and utilization selection, and mixing original and augmented data for conversational dense retrieval fine-tuning.
  • Figure 2: Model performance at the MRR score with various ratios of augmented training samples on two datasets.