Table of Contents
Fetching ...

Generalizing Conversational Dense Retrieval via LLM-Cognition Data Augmentation

Haonan Chen, Zhicheng Dou, Kelong Mao, Jiongnan Liu, Ziliang Zhao

TL;DR

This paper addresses the data sparsity problem in conversational dense retrieval by introducing ConvAug, an LLM driven data augmentation framework augmented with cognition aware prompting and a difficulty adaptive sampling strategy. By generating multi level augmented conversations that include both positives and hard negatives and training with a multi task contrastive objective, ConvAug yields a more robust conversational context encoder. Experiments on four public datasets show strong improvements in both normal and zero shot settings, illustrating the framework's generalization ability and applicability across base retrievers. The work provides a practical pathway to scalable and generalizable conversational search systems with publicly available code.

Abstract

Conversational search utilizes muli-turn natural language contexts to retrieve relevant passages. Existing conversational dense retrieval models mostly view a conversation as a fixed sequence of questions and responses, overlooking the severe data sparsity problem -- that is, users can perform a conversation in various ways, and these alternate conversations are unrecorded. Consequently, they often struggle to generalize to diverse conversations in real-world scenarios. In this work, we propose a framework for generalizing Conversational dense retrieval via LLM-cognition data Augmentation (ConvAug). ConvAug first generates multi-level augmented conversations to capture the diverse nature of conversational contexts. Inspired by human cognition, we devise a cognition-aware process to mitigate the generation of false positives, false negatives, and hallucinations. Moreover, we develop a difficulty-adaptive sample filter that selects challenging samples for complex conversations, thereby giving the model a larger learning space. A contrastive learning objective is then employed to train a better conversational context encoder. Extensive experiments conducted on four public datasets, under both normal and zero-shot settings, demonstrate the effectiveness, generalizability, and applicability of ConvAug. The code is released at https://github.com/haon-chen/ConvAug.

Generalizing Conversational Dense Retrieval via LLM-Cognition Data Augmentation

TL;DR

This paper addresses the data sparsity problem in conversational dense retrieval by introducing ConvAug, an LLM driven data augmentation framework augmented with cognition aware prompting and a difficulty adaptive sampling strategy. By generating multi level augmented conversations that include both positives and hard negatives and training with a multi task contrastive objective, ConvAug yields a more robust conversational context encoder. Experiments on four public datasets show strong improvements in both normal and zero shot settings, illustrating the framework's generalization ability and applicability across base retrievers. The work provides a practical pathway to scalable and generalizable conversational search systems with publicly available code.

Abstract

Conversational search utilizes muli-turn natural language contexts to retrieve relevant passages. Existing conversational dense retrieval models mostly view a conversation as a fixed sequence of questions and responses, overlooking the severe data sparsity problem -- that is, users can perform a conversation in various ways, and these alternate conversations are unrecorded. Consequently, they often struggle to generalize to diverse conversations in real-world scenarios. In this work, we propose a framework for generalizing Conversational dense retrieval via LLM-cognition data Augmentation (ConvAug). ConvAug first generates multi-level augmented conversations to capture the diverse nature of conversational contexts. Inspired by human cognition, we devise a cognition-aware process to mitigate the generation of false positives, false negatives, and hallucinations. Moreover, we develop a difficulty-adaptive sample filter that selects challenging samples for complex conversations, thereby giving the model a larger learning space. A contrastive learning objective is then employed to train a better conversational context encoder. Extensive experiments conducted on four public datasets, under both normal and zero-shot settings, demonstrate the effectiveness, generalizability, and applicability of ConvAug. The code is released at https://github.com/haon-chen/ConvAug.
Paper Structure (29 sections, 2 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 29 sections, 2 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: The training workflow of our framework.
  • Figure 2: An example to illustrate our cognition-aware prompting process and multi-level augmented data.
  • Figure 3: The optimization of context encoder.
  • Figure 4: Turn-level performance comparisons on TopiOCQA (normal) and CAsT-21 (zero-shot).
  • Figure 5: An example to show the generated data of the LLM for a turn in QReCC.
  • ...and 1 more figures