Table of Contents
Fetching ...

BERT-based model for Vietnamese Fact Verification Dataset

Bao Tran, T. N. Khanh, Khang Nguyen Tuong, Thien Dang, Quang Nguyen, Nguyen T. Thinh, Vo T. Hung

TL;DR

This work tackles automated fact verification in Vietnamese by introducing a unified model that jointly performs sentence selection and claim verification, built on PhoBERT and XLM-RoBERTa backbones. It introduces the ISE-DSC01 dataset, a FEVER-like Vietnamese resource with ~49,675 items labeled SUPPORTED, REFUTED, or NEI, and demonstrates that a 1-phase classification approach within a single network outperforms a 2-phase variant and a BERT-FEVER baseline. Key contributions include a transformer-based encoder for claim–sentence pairs, a rationale-selection module selecting the top supporting sentence, and a label-classification module predicting the verdict, all evaluated on Vietnamese data with strong improvements in Strict Accuracy. The results show substantial gains on the private test, highlighting the practical potential for Vietnamese fact verification while acknowledging dataset size as a limitation and pointing to future work on scaling and efficiency.

Abstract

The rapid advancement of information and communication technology has facilitated easier access to information. However, this progress has also necessitated more stringent verification measures to ensure the accuracy of information, particularly within the context of Vietnam. This paper introduces an approach to address the challenges of Fact Verification using the Vietnamese dataset by integrating both sentence selection and classification modules into a unified network architecture. The proposed approach leverages the power of large language models by utilizing pre-trained PhoBERT and XLM-RoBERTa as the backbone of the network. The proposed model was trained on a Vietnamese dataset, named ISE-DSC01, and demonstrated superior performance compared to the baseline model across all three metrics. Notably, we achieved a Strict Accuracy level of 75.11\%, indicating a remarkable 28.83\% improvement over the baseline model.

BERT-based model for Vietnamese Fact Verification Dataset

TL;DR

This work tackles automated fact verification in Vietnamese by introducing a unified model that jointly performs sentence selection and claim verification, built on PhoBERT and XLM-RoBERTa backbones. It introduces the ISE-DSC01 dataset, a FEVER-like Vietnamese resource with ~49,675 items labeled SUPPORTED, REFUTED, or NEI, and demonstrates that a 1-phase classification approach within a single network outperforms a 2-phase variant and a BERT-FEVER baseline. Key contributions include a transformer-based encoder for claim–sentence pairs, a rationale-selection module selecting the top supporting sentence, and a label-classification module predicting the verdict, all evaluated on Vietnamese data with strong improvements in Strict Accuracy. The results show substantial gains on the private test, highlighting the practical potential for Vietnamese fact verification while acknowledging dataset size as a limitation and pointing to future work on scaling and efficiency.

Abstract

The rapid advancement of information and communication technology has facilitated easier access to information. However, this progress has also necessitated more stringent verification measures to ensure the accuracy of information, particularly within the context of Vietnam. This paper introduces an approach to address the challenges of Fact Verification using the Vietnamese dataset by integrating both sentence selection and classification modules into a unified network architecture. The proposed approach leverages the power of large language models by utilizing pre-trained PhoBERT and XLM-RoBERTa as the backbone of the network. The proposed model was trained on a Vietnamese dataset, named ISE-DSC01, and demonstrated superior performance compared to the baseline model across all three metrics. Notably, we achieved a Strict Accuracy level of 75.11\%, indicating a remarkable 28.83\% improvement over the baseline model.

Paper Structure

This paper contains 12 sections, 7 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Data pipeline
  • Figure 2: Pipeline for our approach
  • Figure 3: Pipeline for 3 phase