Table of Contents
Fetching ...

Dewey Long Context Embedding Model: A Technical Report

Dun Zhang, Panxiang Zou, Yudong Zhou

TL;DR

This work addresses the challenge of embedding documents with very long contexts by introducing chunk-alignment training that distills a teacher model into a student to learn both whole-text and chunk-level representations. The dewey_en_beta model extends context length to 128k tokens, employing ModernBERT-Large with RoPE-based long-context enhancements and two encoding modes (CLS and chunk/mean embeddings). Training leverages unsupervised data, two chunking strategies, and a dual-loss objective to align student and teacher embeddings, achieving competitive results on MTEB (eng, v2) and LongEmbed benchmarks. The release aims to advance retrieval-augmented generation and long-document retrieval, providing an open-source, scalable path toward more coherent long-context representations.

Abstract

This technical report presents the training methodology and evaluation results of the open-source dewey_en_beta embedding model. The increasing demand for retrieval-augmented generation (RAG) systems and the expanding context window capabilities of large language models (LLMs) have created critical challenges for conventional embedding models. Current approaches often struggle to maintain semantic coherence when processing documents exceeding typical sequence length limitations, significantly impacting retrieval performance in knowledge-intensive applications. This paper presents dewey_en_beta, a novel text embedding model that achieves excellent performance on MTEB (Eng, v2) and LongEmbed benchmark while supporting 128K token sequences. Our technical contribution centers on chunk alignment training, an innovative methodology that enables the simultaneous generation of localized chunk embeddings and global document-level representations through distillation. Information regarding the model release can be found at https://huggingface.co/infgrad/dewey_en_beta.

Dewey Long Context Embedding Model: A Technical Report

TL;DR

This work addresses the challenge of embedding documents with very long contexts by introducing chunk-alignment training that distills a teacher model into a student to learn both whole-text and chunk-level representations. The dewey_en_beta model extends context length to 128k tokens, employing ModernBERT-Large with RoPE-based long-context enhancements and two encoding modes (CLS and chunk/mean embeddings). Training leverages unsupervised data, two chunking strategies, and a dual-loss objective to align student and teacher embeddings, achieving competitive results on MTEB (eng, v2) and LongEmbed benchmarks. The release aims to advance retrieval-augmented generation and long-document retrieval, providing an open-source, scalable path toward more coherent long-context representations.

Abstract

This technical report presents the training methodology and evaluation results of the open-source dewey_en_beta embedding model. The increasing demand for retrieval-augmented generation (RAG) systems and the expanding context window capabilities of large language models (LLMs) have created critical challenges for conventional embedding models. Current approaches often struggle to maintain semantic coherence when processing documents exceeding typical sequence length limitations, significantly impacting retrieval performance in knowledge-intensive applications. This paper presents dewey_en_beta, a novel text embedding model that achieves excellent performance on MTEB (Eng, v2) and LongEmbed benchmark while supporting 128K token sequences. Our technical contribution centers on chunk alignment training, an innovative methodology that enables the simultaneous generation of localized chunk embeddings and global document-level representations through distillation. Information regarding the model release can be found at https://huggingface.co/infgrad/dewey_en_beta.

Paper Structure

This paper contains 4 sections, 4 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Model architecture