Patent Language Model Pretraining with ModernBERT
Amirhossein Yousefiramandi, Ciaran Cooney
TL;DR
This work tackles the domain shift inherent in patent text by pretraining domain-specific masked language models with ModernBERT on a large, curated patent corpus. By combining architectural optimizations (FlashAttention, ALiBi, GLU) and a domain-tailored BPE tokenizer, the authors demonstrate improved downstream classification performance and substantially faster inference relative to PatentBERT. The results show that domain-specific pretraining, model scaling, and tokenizer customization provide complementary gains across patent datasets, with notable efficiency benefits for time-sensitive applications. Overall, the study highlights the value of targeted pretraining and architecture-tokenizer synergy for advancing patent NLP tasks.
Abstract
Transformer-based language models such as BERT have become foundational in NLP, yet their performance degrades in specialized domains like patents, which contain long, technical, and legally structured text. Prior approaches to patent NLP have primarily relied on fine-tuning general-purpose models or domain-adapted variants pretrained with limited data. In this work, we pretrain 3 domain-specific masked language models for patents, using the ModernBERT architecture and a curated corpus of over 60 million patent records. Our approach incorporates architectural optimizations, including FlashAttention, rotary embeddings, and GLU feed-forward layers. We evaluate our models on four downstream patent classification tasks. Our model, ModernBERT-base-PT, consistently outperforms the general-purpose ModernBERT baseline on three out of four datasets and achieves competitive performance with a baseline PatentBERT. Additional experiments with ModernBERT-base-VX and Mosaic-BERT-large demonstrate that scaling the model size and customizing the tokenizer further enhance performance on selected tasks. Notably, all ModernBERT variants retain substantially faster inference over - 3x that of PatentBERT - underscoring their suitability for time-sensitive applications. These results underscore the benefits of domain-specific pretraining and architectural improvements for patent-focused NLP tasks.
