Table of Contents
Fetching ...

llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length

Issa Sugiura, Kouta Nakayama, Yusuke Oda

TL;DR

This work investigates encoder-only pretraining with long contexts by presenting llm-jp-modernbert, a ModernBERT model trained on a massive Japanese corpus with a context length of $8192$. Built on the ModernBERT architecture (RoPE, Local-Global Attention, FlashAttention) and a tailored tokenizer, it undergoes a two-stage pretraining from $1024$ to $8192$ tokens per sequence, using a Japanese dataset of ~0.69T tokens and MLM with a $0.30$ masking rate. While downstream tasks on JGLUE do not outperform baselines, the model shows strong fill-mask performance and reveals how context length expansion affects pseudo-perplexity; embedding dynamics highlight an alignment–uniformity trade-off during training, with final checkpoints mirroring characteristics of similar architectures. By releasing training and evaluation code, the paper contributes to reproducibility and fosters exploration of long-context encoder models in Japanese NLP, offering a framework for analyzing long-context effects and sentence embeddings in this domain.

Abstract

Encoder-only transformer models like BERT are widely adopted as a pre-trained backbone for tasks like sentence classification and retrieval. However, pretraining of encoder models with large-scale corpora and long contexts has been relatively underexplored compared to decoder-only transformers. In this work, we present llm-jp-modernbert, a ModernBERT model trained on a publicly available, massive Japanese corpus with a context length of 8192 tokens. While our model does not surpass existing baselines on downstream tasks, it achieves good results on fill-mask test evaluations. We also analyze the effect of context length expansion through pseudo-perplexity experiments. Furthermore, we investigate sentence embeddings in detail, analyzing their transitions during training and comparing them with those from other existing models, confirming similar trends with models sharing the same architecture. To support reproducibility and foster the development of long-context BERT, we release our model, along with the training and evaluation code.

llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length

TL;DR

This work investigates encoder-only pretraining with long contexts by presenting llm-jp-modernbert, a ModernBERT model trained on a massive Japanese corpus with a context length of . Built on the ModernBERT architecture (RoPE, Local-Global Attention, FlashAttention) and a tailored tokenizer, it undergoes a two-stage pretraining from to tokens per sequence, using a Japanese dataset of ~0.69T tokens and MLM with a masking rate. While downstream tasks on JGLUE do not outperform baselines, the model shows strong fill-mask performance and reveals how context length expansion affects pseudo-perplexity; embedding dynamics highlight an alignment–uniformity trade-off during training, with final checkpoints mirroring characteristics of similar architectures. By releasing training and evaluation code, the paper contributes to reproducibility and fosters exploration of long-context encoder models in Japanese NLP, offering a framework for analyzing long-context effects and sentence embeddings in this domain.

Abstract

Encoder-only transformer models like BERT are widely adopted as a pre-trained backbone for tasks like sentence classification and retrieval. However, pretraining of encoder models with large-scale corpora and long contexts has been relatively underexplored compared to decoder-only transformers. In this work, we present llm-jp-modernbert, a ModernBERT model trained on a publicly available, massive Japanese corpus with a context length of 8192 tokens. While our model does not surpass existing baselines on downstream tasks, it achieves good results on fill-mask test evaluations. We also analyze the effect of context length expansion through pseudo-perplexity experiments. Furthermore, we investigate sentence embeddings in detail, analyzing their transitions during training and comparing them with those from other existing models, confirming similar trends with models sharing the same architecture. To support reproducibility and foster the development of long-context BERT, we release our model, along with the training and evaluation code.

Paper Structure

This paper contains 21 sections, 2 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 3: Alignment and uniformity. s1 and s2 represent Stage 1 and Stage 2, respectively.
  • Figure 5: The sequence length distribution of sentences with various sequence lengths ranging from 0 to 8192, prepared for the pseudo-perplexity experiment. The sequence length in this figure refers to the token count obtained using the llm-jp-tokenizer v3.
  • Figure : Loss on validation data
  • Figure : 500k steps in Stage 1
  • Figure : 0k steps in Stage 1
  • ...and 16 more figures