SparseCL: Sparse Contrastive Learning for Contradiction Retrieval

Haike Xu; Zongyu Lin; Yizhou Sun; Kai-Wei Chang; Piotr Indyk

SparseCL: Sparse Contrastive Learning for Contradiction Retrieval

Haike Xu, Zongyu Lin, Yizhou Sun, Kai-Wei Chang, Piotr Indyk

TL;DR

This paper introduces SparseCL, a sparsity-aware sentence-embedding approach for contradiction retrieval that combines cosine similarity with a sparsity-based difference measure (Hoyer sparsity). By training embeddings to emphasize sparse differences between contradicted passages, SparseCL achieves faster and more accurate retrieval than traditional bi-encoders and cross-encoders, validated on the Arguana dataset and synthetic data derived from MSMARCO and HotpotQA. The method also demonstrates practical utility in corpus cleaning, recovering QA retrieval performance after injecting contradictory content. Overall, SparseCL offers a scalable and effective solution for detecting contradictions in large text corpora, with potential for sublinear sparsity-based nearest-neighbor search in the future.

Abstract

Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query, which is important to many downstream applications like fact checking and data cleaning. To retrieve contradiction argument to the query from large document corpora, existing methods such as similarity search and crossencoder models exhibit significant limitations. The former struggles to capture the essence of contradiction due to its inherent nature of favoring similarity, while the latter suffers from computational inefficiency, especially when the size of corpora is large. To address these challenges, we introduce a novel approach: SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences. Our method utilizes a combined metric of cosine similarity and a sparsity function to efficiently identify and retrieve documents that contradict a given query. This approach dramatically enhances the speed of contradiction detection by reducing the need for exhaustive document comparisons to simple vector calculations. We validate our model using the Arguana dataset, a benchmark dataset specifically geared towards contradiction retrieval, as well as synthetic contradictions generated from the MSMARCO and HotpotQA datasets using GPT-4. Our experiments demonstrate the efficacy of our approach not only in contradiction retrieval with more than 30% accuracy improvements on MSMARCO and HotpotQA across different model architectures but also in applications such as cleaning corrupted corpora to restore high-quality QA retrieval. This paper outlines a promising direction for improving the accuracy and efficiency of contradiction retrieval in large-scale text corpora.

SparseCL: Sparse Contrastive Learning for Contradiction Retrieval

TL;DR

Abstract

Paper Structure (40 sections, 6 equations, 3 figures, 9 tables)

This paper contains 40 sections, 6 equations, 3 figures, 9 tables.

Introduction
Related Work
Counter Argument Retrieval
Fact verification and LLM hallucination
Learning augmented LLM and retrieval corpus attack
Method
Problem Formulation
Embedding based method
Sparsity Enhanced Embeddings
SparseCL
Scoring function for contradiction retrieval
Experiments
Counter-argument Retrieval
Dataset
Training
...and 25 more sections

Figures (3)

Figure 1: Performance gains in NDCG@10 score across different sentence embedding models and datasets, showcasing the effectiveness and robustness of our SparseCL compared with standard contrastive learning (CL)
Figure 2: Comparison of our SparseCL with Cross-Encoder and Contrastive-Learning based Bi-Encoder for contradiction retrieval.
Figure 3: Histograms for the Hoyer sparsity of different pairs of sentence embedding differences on HotpotQA test set. The left figure is the histogram produced by a standard sentence embedding model ("bge-base-en-v1.5"), where the median Hoyer sparsity values for random pairs, paraphrases, and contradictions are $0.212,0.211,0.211$. The right figure is the histogram produced by our sentence embedding model fine-tuned from "bge-base-en-v1.5" using our SparseCL method, where the median Hoyer sparsity values for random pairs, paraphrases, and contradictions are $0.212,0.281,0.632$.

SparseCL: Sparse Contrastive Learning for Contradiction Retrieval

TL;DR

Abstract

SparseCL: Sparse Contrastive Learning for Contradiction Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (3)