Table of Contents
Fetching ...

Efficient Document Embeddings via Self-Contrastive Bregman Divergence Learning

Daniel Saggau, Mina Rezaei, Bernd Bischl, Ilias Chalkidis

TL;DR

This paper tackles the challenge of producing high-quality embeddings for long documents by combining a domain-adapted Longformer encoder with self-supervised contrastive learning (SimCSE) and augmenting the model with an ensemble of subnetworks trained via functional Bregman divergence. The proposed L_Total objective fuses the standard contrastive loss with a divergence-based regularizer to encourage diverse, non-redundant representations, yielding improved performance in linear evaluation on three long-document datasets from legal and biomedical domains. Empirical results show gains over baselines, with strong efficiency benefits (2–8x faster training) and notable improvements in few-shot settings, while reducing the risk of representation collapse. The approach suggests a practical path toward efficient, high-quality long-document embeddings suitable for IR and downstream classification tasks in real-world systems.

Abstract

Learning quality document embeddings is a fundamental problem in natural language processing (NLP), information retrieval (IR), recommendation systems, and search engines. Despite recent advances in the development of transformer-based models that produce sentence embeddings with self-contrastive learning, the encoding of long documents (Ks of words) is still challenging with respect to both efficiency and quality considerations. Therefore, we train Longfomer-based document encoders using a state-of-the-art unsupervised contrastive learning method (SimCSE). Further on, we complement the baseline method -- siamese neural network -- with additional convex neural networks based on functional Bregman divergence aiming to enhance the quality of the output document representations. We show that overall the combination of a self-contrastive siamese network and our proposed neural Bregman network outperforms the baselines in two linear classification settings on three long document topic classification tasks from the legal and biomedical domains.

Efficient Document Embeddings via Self-Contrastive Bregman Divergence Learning

TL;DR

This paper tackles the challenge of producing high-quality embeddings for long documents by combining a domain-adapted Longformer encoder with self-supervised contrastive learning (SimCSE) and augmenting the model with an ensemble of subnetworks trained via functional Bregman divergence. The proposed L_Total objective fuses the standard contrastive loss with a divergence-based regularizer to encourage diverse, non-redundant representations, yielding improved performance in linear evaluation on three long-document datasets from legal and biomedical domains. Empirical results show gains over baselines, with strong efficiency benefits (2–8x faster training) and notable improvements in few-shot settings, while reducing the risk of representation collapse. The approach suggests a practical path toward efficient, high-quality long-document embeddings suitable for IR and downstream classification tasks in real-world systems.

Abstract

Learning quality document embeddings is a fundamental problem in natural language processing (NLP), information retrieval (IR), recommendation systems, and search engines. Despite recent advances in the development of transformer-based models that produce sentence embeddings with self-contrastive learning, the encoding of long documents (Ks of words) is still challenging with respect to both efficiency and quality considerations. Therefore, we train Longfomer-based document encoders using a state-of-the-art unsupervised contrastive learning method (SimCSE). Further on, we complement the baseline method -- siamese neural network -- with additional convex neural networks based on functional Bregman divergence aiming to enhance the quality of the output document representations. We show that overall the combination of a self-contrastive siamese network and our proposed neural Bregman network outperforms the baselines in two linear classification settings on three long document topic classification tasks from the legal and biomedical domains.
Paper Structure (17 sections, 7 equations, 1 figure, 7 tables)

This paper contains 17 sections, 7 equations, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Illustration of our proposed self-contrastive method combining SimCSE of gao_simcse_2021 (left part) with the additional Bregman divergence networks and objective of rezaei2021deep (right part).