Table of Contents
Fetching ...

ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations

Khang T. Huynh, Dung H. Nguyen, Binh T. Nguyen

TL;DR

ViConBERT introduces a Vietnamese contextualized embedding model that aligns word-context representations with gloss definitions through gloss-context contrastive learning. By distilling a gloss space from a pretrained sentence-embedding model and training a context encoder with InfoNCE plus a Semantic Structure Loss, ViConBERT yields continuous, sense-aware representations for Vietnamese words. The authors also present ViConWSD, a large-scale synthetic benchmark for Vietnamese WSD and contextual similarity, generated from Vietnamese WordNet with glosses and context sentences produced by LLMs. Empirical results show state-of-the-art or competitive performance on WSD (F1 = 0.87), ViCon (AP = 0.88), and ViSim-400 (Spearman ρ = 0.60), validating the effectiveness of gloss-guided contextualization and synthetic evaluation in a low-resource setting. The work provides a practical framework and resources to advance fine-grained semantic understanding for Vietnamese, with potential applicability to other low-resource languages.

Abstract

Recent advances in contextualized word embeddings have greatly improved semantic tasks such as Word Sense Disambiguation (WSD) and contextual similarity, but most progress has been limited to high-resource languages like English. Vietnamese, in contrast, still lacks robust models and evaluation resources for fine-grained semantic understanding. In this paper, we present ViConBERT, a novel framework for learning Vietnamese contextualized embeddings that integrates contrastive learning (SimCLR) and gloss-based distillation to better capture word meaning. We also introduce ViConWSD, the first large-scale synthetic dataset for evaluating semantic understanding in Vietnamese, covering both WSD and contextual similarity. Experimental results show that ViConBERT outperforms strong baselines on WSD (F1 = 0.87) and achieves competitive performance on ViCon (AP = 0.88) and ViSim-400 (Spearman's rho = 0.60), demonstrating its effectiveness in modeling both discrete senses and graded semantic relations. Our code, models, and data are available at https://github.com/tkhangg0910/ViConBERT

ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations

TL;DR

ViConBERT introduces a Vietnamese contextualized embedding model that aligns word-context representations with gloss definitions through gloss-context contrastive learning. By distilling a gloss space from a pretrained sentence-embedding model and training a context encoder with InfoNCE plus a Semantic Structure Loss, ViConBERT yields continuous, sense-aware representations for Vietnamese words. The authors also present ViConWSD, a large-scale synthetic benchmark for Vietnamese WSD and contextual similarity, generated from Vietnamese WordNet with glosses and context sentences produced by LLMs. Empirical results show state-of-the-art or competitive performance on WSD (F1 = 0.87), ViCon (AP = 0.88), and ViSim-400 (Spearman ρ = 0.60), validating the effectiveness of gloss-guided contextualization and synthetic evaluation in a low-resource setting. The work provides a practical framework and resources to advance fine-grained semantic understanding for Vietnamese, with potential applicability to other low-resource languages.

Abstract

Recent advances in contextualized word embeddings have greatly improved semantic tasks such as Word Sense Disambiguation (WSD) and contextual similarity, but most progress has been limited to high-resource languages like English. Vietnamese, in contrast, still lacks robust models and evaluation resources for fine-grained semantic understanding. In this paper, we present ViConBERT, a novel framework for learning Vietnamese contextualized embeddings that integrates contrastive learning (SimCLR) and gloss-based distillation to better capture word meaning. We also introduce ViConWSD, the first large-scale synthetic dataset for evaluating semantic understanding in Vietnamese, covering both WSD and contextual similarity. Experimental results show that ViConBERT outperforms strong baselines on WSD (F1 = 0.87) and achieves competitive performance on ViCon (AP = 0.88) and ViSim-400 (Spearman's rho = 0.60), demonstrating its effectiveness in modeling both discrete senses and graded semantic relations. Our code, models, and data are available at https://github.com/tkhangg0910/ViConBERT

Paper Structure

This paper contains 25 sections, 7 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Training architecture of ViConBERT. Left (red): the Context Encoder processes sentences (e.g., "Anh ấy đang khoan tường." He is drilling the wall) with the target word khoan (drill) via multi-head attention and projection to produce contextual embeddings. Right (blue): the Gloss Encoder encodes glosses (e.g., "Hành động tạo ra một lỗ..." An action that creates a hole...) into gloss embeddings. The objective combines InfoNCE to align context and gloss and Semantic Structure Loss to preserve their relative semantic structure. Context block is trainable; gloss block is frozen.
  • Figure 2: Synthetic dataset construction pipeline.
  • Figure 3: Embedding space for different word types