Table of Contents
Fetching ...

An Encoder-Integrated PhoBERT with Graph Attention for Vietnamese Token-Level Classification

Ba-Quang Nguyen

TL;DR

This work addresses Vietnamese token-level classification under data-scarce conditions by introducing TextGraphFuseGAT, a hybrid architecture that fuses a monolingual encoder (PhoBERT) with a fully connected token graph via Graph Attention Networks and refines representations with a Transformer decoder. The model jointly fine-tunes all components, producing $H \in \mathbb{R}^{n \times d}$ which is augmented by $H^{\text{gat}}$ and further refined to $H^{\text{dec}}$ before classification. It achieves state-of-the-art results across three benchmarks (PhoNER-COVID19, PhoDisfluency, VietMed-NER), with Micro-F1 scores near ceiling on simpler datasets and notable Macro-F1 gains on VietMed-NER, demonstrating that explicit relational modeling complements sequential encoding in multilingual, low-resource contexts. The findings highlight the practical impact of combining graph-based relational biases with transformer-based decoding for robust token-level labeling in Vietnamese, and point to future extensions in multilingual transfer, domain adaptation, and explainability.

Abstract

We propose a novel neural architecture named TextGraphFuseGAT, which integrates a pretrained transformer encoder (PhoBERT) with Graph Attention Networks for token-level classification tasks. The proposed model constructs a fully connected graph over the token embeddings produced by PhoBERT, enabling the GAT layer to capture rich inter-token dependencies beyond those modeled by sequential context alone. To further enhance contextualization, a Transformer-style self-attention layer is applied on top of the graph-enhanced embeddings. The final token representations are passed through a classification head to perform sequence labeling. We evaluate our approach on three Vietnamese benchmark datasets: PhoNER-COVID19 for named entity recognition in the COVID-19 domain, PhoDisfluency for speech disfluency detection, and VietMed-NER for medical-domain NER. VietMed-NER is the first Vietnamese medical spoken NER dataset, featuring 18 entity types collected from real-world medical speech transcripts and annotated with the BIO tagging scheme. Its specialized vocabulary and domain-specific expressions make it a challenging benchmark for token-level classification models. Experimental results show that our method consistently outperforms strong baselines, including transformer-only and hybrid neural models such as BiLSTM + CNN + CRF, confirming the effectiveness of combining pretrained semantic features with graph-based relational modeling for improved token classification across multiple domains.

An Encoder-Integrated PhoBERT with Graph Attention for Vietnamese Token-Level Classification

TL;DR

This work addresses Vietnamese token-level classification under data-scarce conditions by introducing TextGraphFuseGAT, a hybrid architecture that fuses a monolingual encoder (PhoBERT) with a fully connected token graph via Graph Attention Networks and refines representations with a Transformer decoder. The model jointly fine-tunes all components, producing which is augmented by and further refined to before classification. It achieves state-of-the-art results across three benchmarks (PhoNER-COVID19, PhoDisfluency, VietMed-NER), with Micro-F1 scores near ceiling on simpler datasets and notable Macro-F1 gains on VietMed-NER, demonstrating that explicit relational modeling complements sequential encoding in multilingual, low-resource contexts. The findings highlight the practical impact of combining graph-based relational biases with transformer-based decoding for robust token-level labeling in Vietnamese, and point to future extensions in multilingual transfer, domain adaptation, and explainability.

Abstract

We propose a novel neural architecture named TextGraphFuseGAT, which integrates a pretrained transformer encoder (PhoBERT) with Graph Attention Networks for token-level classification tasks. The proposed model constructs a fully connected graph over the token embeddings produced by PhoBERT, enabling the GAT layer to capture rich inter-token dependencies beyond those modeled by sequential context alone. To further enhance contextualization, a Transformer-style self-attention layer is applied on top of the graph-enhanced embeddings. The final token representations are passed through a classification head to perform sequence labeling. We evaluate our approach on three Vietnamese benchmark datasets: PhoNER-COVID19 for named entity recognition in the COVID-19 domain, PhoDisfluency for speech disfluency detection, and VietMed-NER for medical-domain NER. VietMed-NER is the first Vietnamese medical spoken NER dataset, featuring 18 entity types collected from real-world medical speech transcripts and annotated with the BIO tagging scheme. Its specialized vocabulary and domain-specific expressions make it a challenging benchmark for token-level classification models. Experimental results show that our method consistently outperforms strong baselines, including transformer-only and hybrid neural models such as BiLSTM + CNN + CRF, confirming the effectiveness of combining pretrained semantic features with graph-based relational modeling for improved token classification across multiple domains.

Paper Structure

This paper contains 26 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Overall architecture of the proposed model.