A Combination of BERT and Transformer for Vietnamese Spelling Correction
Hieu Ngo Trung, Duong Tran Ham, Tin Huynh, Kiem Hoang
TL;DR
The paper addresses Vietnamese spelling correction, a task complicated by diacritics and typographical errors, by integrating pre-trained BERT contextual embeddings with a Transformer-based Encoder‑Decoder. It builds a large synthetic dataset from the Binhvq News Corpus and evaluates using BLEU, demonstrating that Transformer variants augmented with BERT, especially PhoBERT-based inputs, achieve superior performance. The key contribution is a practical, end-to-end BERT‑Transformer framework for Vietnamese spelling correction, including dataset construction and a thorough experimental comparison that yields a best BLEU of 0.8624, outperforming Google Docs spell checking and Word2Vec baselines. The work highlights the value of contextual embeddings in spelling correction and outlines future directions such as exploring additional pre-trained models and larger datasets to improve robustness and handling of proper nouns.
Abstract
Recently, many studies have shown the efficiency of using Bidirectional Encoder Representations from Transformers (BERT) in various Natural Language Processing (NLP) tasks. Specifically, English spelling correction task that uses Encoder-Decoder architecture and takes advantage of BERT has achieved state-of-the-art result. However, to our knowledge, there is no implementation in Vietnamese yet. Therefore, in this study, a combination of Transformer architecture (state-of-the-art for Encoder-Decoder model) and BERT was proposed to deal with Vietnamese spelling correction. The experiment results have shown that our model outperforms other approaches as well as the Google Docs Spell Checking tool, achieves an 86.24 BLEU score on this task.
