Table of Contents
Fetching ...

ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance

Wissam Antoun, Benoît Sagot, Djamé Seddah

TL;DR

The paper investigates whether improvements claimed by ModernBERT arise from architecture or data by retraining ModernBERT on shared French datasets and comparing against DeBERTaV3 and RoBERTa baselines. Using two pretraining corpora, including a high-quality filtered dataset, the study shows DeBERTaV3 generally delivers better sample efficiency and task performance under identical data, except for text retrieval where ModernBERT excels. ModernBERT offers substantially faster training and inference, highlighting efficiency gains that are orthogonal to accuracy. The findings stress the importance of fair, data-controlled comparisons and suggest benchmark saturation limits for current NLP tasks, while providing publicly released models to support reproducibility.

Abstract

Pretrained transformer-encoder models like DeBERTaV3 and ModernBERT introduce architectural advancements aimed at improving efficiency and performance. Although the authors of ModernBERT report improved performance over DeBERTaV3 on several benchmarks, the lack of disclosed training data and the absence of comparisons using a shared dataset make it difficult to determine whether these gains are due to architectural improvements or differences in training data. In this work, we conduct a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of model design. Our results show that the previous model generation remains superior in sample efficiency and overall benchmark performance, with ModernBERT's primary advantage being its support for long context, faster training, and inference speed. However, the new proposed model still provides meaningful architectural improvements compared to earlier models such as BERT and RoBERTa. Additionally, we observe that high-quality pre-training data accelerates convergence but does not significantly improve final performance, suggesting potential benchmark saturation. These findings show the importance of disentangling pretraining data from architectural innovations when evaluating transformer models.

ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance

TL;DR

The paper investigates whether improvements claimed by ModernBERT arise from architecture or data by retraining ModernBERT on shared French datasets and comparing against DeBERTaV3 and RoBERTa baselines. Using two pretraining corpora, including a high-quality filtered dataset, the study shows DeBERTaV3 generally delivers better sample efficiency and task performance under identical data, except for text retrieval where ModernBERT excels. ModernBERT offers substantially faster training and inference, highlighting efficiency gains that are orthogonal to accuracy. The findings stress the importance of fair, data-controlled comparisons and suggest benchmark saturation limits for current NLP tasks, while providing publicly released models to support reproducibility.

Abstract

Pretrained transformer-encoder models like DeBERTaV3 and ModernBERT introduce architectural advancements aimed at improving efficiency and performance. Although the authors of ModernBERT report improved performance over DeBERTaV3 on several benchmarks, the lack of disclosed training data and the absence of comparisons using a shared dataset make it difficult to determine whether these gains are due to architectural improvements or differences in training data. In this work, we conduct a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of model design. Our results show that the previous model generation remains superior in sample efficiency and overall benchmark performance, with ModernBERT's primary advantage being its support for long context, faster training, and inference speed. However, the new proposed model still provides meaningful architectural improvements compared to earlier models such as BERT and RoBERTa. Additionally, we observe that high-quality pre-training data accelerates convergence but does not significantly improve final performance, suggesting potential benchmark saturation. These findings show the importance of disentangling pretraining data from architectural innovations when evaluating transformer models.

Paper Structure

This paper contains 25 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Downstream Performance on QA throughout the pre-training stage. wsd are the models tested before the cooldown period.
  • Figure 2: Downstream Performance on NER throughout the pre-training stage. wsd are the models tested before the cooldown period.
  • Figure 3: Instances of divergence during QA fine-tuning. Colored lines illustrate the maximum score at a given step.