SetCSE: Set Operations using Contrastive Learning of Sentence Embeddings
Kang Liu
TL;DR
SetCSE tackles the challenge of expressing and querying complex, multi-sentence semantics by representing meanings as sets of sentences and learning context-specific discriminative embeddings through an inter-set contrastive loss ${\mathcal{L}_{\text{inter-set}}}$. It defines concrete SetCSE operations, notably intersection and difference, and a ranking mechanism based on aggregated sentence-set similarities ${\text{SIM}}(x,S)$, enabling structured retrieval with non-commutative, order-sensitive behavior. Empirical results show substantial gains in intersection (~${39}\%$ accuracy, ~${37}\%$ F1) and difference (~${18}\%$ accuracy, ~${21}\%$ F1) across diverse datasets, while encoder-based models outperform decoder-based ones after fine-tuning. The framework supports complex semantic search, data annotation/active learning, and new topic discovery, demonstrated on ESG and Twitter data, with promising implications for real-world information retrieval tasks. Future work includes broader benchmarks, larger embedding models, and a public SetCSE API interface to facilitate adoption.
Abstract
Taking inspiration from Set Theory, we introduce SetCSE, an innovative information retrieval framework. SetCSE employs sets to represent complex semantics and incorporates well-defined operations for structured information querying under the provided context. Within this framework, we introduce an inter-set contrastive learning objective to enhance comprehension of sentence embedding models concerning the given semantics. Furthermore, we present a suite of operations, including SetCSE intersection, difference, and operation series, that leverage sentence embeddings of the enhanced model for complex sentence retrieval tasks. Throughout this paper, we demonstrate that SetCSE adheres to the conventions of human language expressions regarding compounded semantics, provides a significant enhancement in the discriminatory capability of underlying sentence embedding models, and enables numerous information retrieval tasks involving convoluted and intricate prompts which cannot be achieved using existing querying methods.
