Efficient Document Embeddings via Self-Contrastive Bregman Divergence Learning
Daniel Saggau, Mina Rezaei, Bernd Bischl, Ilias Chalkidis
TL;DR
This paper tackles the challenge of producing high-quality embeddings for long documents by combining a domain-adapted Longformer encoder with self-supervised contrastive learning (SimCSE) and augmenting the model with an ensemble of subnetworks trained via functional Bregman divergence. The proposed L_Total objective fuses the standard contrastive loss with a divergence-based regularizer to encourage diverse, non-redundant representations, yielding improved performance in linear evaluation on three long-document datasets from legal and biomedical domains. Empirical results show gains over baselines, with strong efficiency benefits (2–8x faster training) and notable improvements in few-shot settings, while reducing the risk of representation collapse. The approach suggests a practical path toward efficient, high-quality long-document embeddings suitable for IR and downstream classification tasks in real-world systems.
Abstract
Learning quality document embeddings is a fundamental problem in natural language processing (NLP), information retrieval (IR), recommendation systems, and search engines. Despite recent advances in the development of transformer-based models that produce sentence embeddings with self-contrastive learning, the encoding of long documents (Ks of words) is still challenging with respect to both efficiency and quality considerations. Therefore, we train Longfomer-based document encoders using a state-of-the-art unsupervised contrastive learning method (SimCSE). Further on, we complement the baseline method -- siamese neural network -- with additional convex neural networks based on functional Bregman divergence aiming to enhance the quality of the output document representations. We show that overall the combination of a self-contrastive siamese network and our proposed neural Bregman network outperforms the baselines in two linear classification settings on three long document topic classification tasks from the legal and biomedical domains.
