Table of Contents
Fetching ...

Computational Modelling of Plurality and Definiteness in Chinese Noun Phrases

Yuqi Liu, Guanyi Chen, Kees van Deemter

TL;DR

This work investigates how Chinese noun phrase plurality and definiteness are inferred from context by constructing a large parallel-data–driven corpus and evaluating a spectrum of models from classical ML to state-of-the-art PLMs. Through automatic cross-language annotation and human quality checks, it demonstrates that explicit markers are infrequent, yet substantial information about plurality and definiteness is recoverable from context. PLMs, especially BERT-wwm and RoBERTa variants, substantially outperform traditional models, and joint 4-way predictions yield the best results, suggesting interdependence between plurality and definiteness in pragmatic understanding. The findings support the notion of Chinese as a 'cool' language where listeners leverage context, with implications for contextual NLP, cross-language pragmatics, and corpus-based annotation strategies.

Abstract

Theoretical linguists have suggested that some languages (e.g., Chinese and Japanese) are "cooler" than other languages based on the observation that the intended meaning of phrases in these languages depends more on their contexts. As a result, many expressions in these languages are shortened, and their meaning is inferred from the context. In this paper, we focus on the omission of the plurality and definiteness markers in Chinese noun phrases (NPs) to investigate the predictability of their intended meaning given the contexts. To this end, we built a corpus of Chinese NPs, each of which is accompanied by its corresponding context, and by labels indicating its singularity/plurality and definiteness/indefiniteness. We carried out corpus assessments and analyses. The results suggest that Chinese speakers indeed drop plurality and definiteness markers very frequently. Building on the corpus, we train a bank of computational models using both classic machine learning models and state-of-the-art pre-trained language models to predict the plurality and definiteness of each NP. We report on the performance of these models and analyse their behaviours.

Computational Modelling of Plurality and Definiteness in Chinese Noun Phrases

TL;DR

This work investigates how Chinese noun phrase plurality and definiteness are inferred from context by constructing a large parallel-data–driven corpus and evaluating a spectrum of models from classical ML to state-of-the-art PLMs. Through automatic cross-language annotation and human quality checks, it demonstrates that explicit markers are infrequent, yet substantial information about plurality and definiteness is recoverable from context. PLMs, especially BERT-wwm and RoBERTa variants, substantially outperform traditional models, and joint 4-way predictions yield the best results, suggesting interdependence between plurality and definiteness in pragmatic understanding. The findings support the notion of Chinese as a 'cool' language where listeners leverage context, with implications for contextual NLP, cross-language pragmatics, and corpus-based annotation strategies.

Abstract

Theoretical linguists have suggested that some languages (e.g., Chinese and Japanese) are "cooler" than other languages based on the observation that the intended meaning of phrases in these languages depends more on their contexts. As a result, many expressions in these languages are shortened, and their meaning is inferred from the context. In this paper, we focus on the omission of the plurality and definiteness markers in Chinese noun phrases (NPs) to investigate the predictability of their intended meaning given the contexts. To this end, we built a corpus of Chinese NPs, each of which is accompanied by its corresponding context, and by labels indicating its singularity/plurality and definiteness/indefiniteness. We carried out corpus assessments and analyses. The results suggest that Chinese speakers indeed drop plurality and definiteness markers very frequently. Building on the corpus, we train a bank of computational models using both classic machine learning models and state-of-the-art pre-trained language models to predict the plurality and definiteness of each NP. We report on the performance of these models and analyse their behaviours.
Paper Structure (29 sections, 4 figures, 4 tables)

This paper contains 29 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Illustration of the PLM-based Models.
  • Figure 2: Weighted F1 concerning different context sizes. The size is measured by the number of sentences around the target sentence.
  • Figure 3: The confusion matrix for 4-way prediction of RoBERTa-large, in which S, P, I and D mean "singular", "plural", "indefinite" and "definite", respectively.
  • Figure 4: Macro F-scores of BERT-based models on implicit and explicit expressions of plurality and definiteness. The blue bars indicate the performance of models on implicit expressions while the orange bars indicate that on explicit expressions.