Resolving Regular Polysemy in Named Entities
Shu-Kai Hsieh, Yu-Hsiang Tseng, Hsin-Yu Chou, Ching-Wen Yang, Yu-Yun Chang
TL;DR
This work tackles the challenge of regular polysemy in proper names by modeling dot-object abstractions to represent multiple senses of named entities. It proposes a unified, gloss-based WSD framework (GlossBERT) capable of disambiguating both CWN-listed common word senses and dot-object-based proper-name usages in Mandarin Chinese, leveraging CWN glosses, example sentences, and Wikidata mappings. The authors construct two annotated datasets: a CWN-based WSD dataset and a dot-object annotated proper-noun dataset, with 113 difficult words and seven dot-object types, respectively, and demonstrate strong performance (0.86 WSD and 0.88 RP) under a POS-guided regime. Beyond achieving high accuracy, the approach supports lexical-resource development by linking contextual usage to sense inventories and dot-object categories, enabling scalable handling of open-class proper nouns and paving the way for extending regular polysemy analyses to common words as well. The results highlight the value of gloss-based, context-aware disambiguation for enriching lexical resources and improving cross-resource alignment in Mandarin NLP.
Abstract
Word sense disambiguation primarily addresses the lexical ambiguity of common words based on a predefined sense inventory. Conversely, proper names are usually considered to denote an ad-hoc real-world referent. Once the reference is decided, the ambiguity is purportedly resolved. However, proper names also exhibit ambiguities through appellativization, i.e., they act like common words and may denote different aspects of their referents. We proposed to address the ambiguities of proper names through the light of regular polysemy, which we formalized as dot objects. This paper introduces a combined word sense disambiguation (WSD) model for disambiguating common words against Chinese Wordnet (CWN) and proper names as dot objects. The model leverages the flexibility of a gloss-based model architecture, which takes advantage of the glosses and example sentences of CWN. We show that the model achieves competitive results on both common and proper nouns, even on a relatively sparse sense dataset. Aside from being a performant WSD tool, the model further facilitates the future development of the lexical resource.
