Evaluating Distributed Representations for Multi-Level Lexical Semantics: A Research Proposal
Zhu Liu
TL;DR
The paper tackles how distributed representations from PLMs and LLMs encode lexical semantics, assessed across three levels—local, global, and mixed—with cross-lingual benchmarks. It introduces a formal four-space model $(\mathcal{W}, \mathcal{M}, \mathcal{R}, \mathcal{C})$ and level-specific likelihoods, e.g., $p(e)=p(e|w,s)$, $p(\mathcal{M})=p([e_i]_N|\mathcal{W})$, and $p(\mathcal{C})=p([c_i]_M|\mathcal{W}, \mathcal{R})$, to structure evaluations. It develops analyses of local sense continuity, uncertainty in WSD, semantic roles through minimal-language pairs, global word networks, and mixed conceptual spaces via Semantic Map Models, leveraging both PLMs and LLMs. By exposing extraction, probe-design, dataset bias, and scaling-related interpretability challenges, the work aims to advance transparent, cross-linguistic lexical semantics and close gaps between computational models and linguistic theory.
Abstract
Modern neural networks (NNs), trained on extensive raw sentence data, construct distributed representations by compressing individual words into dense, continuous, high-dimensional vectors. These representations are expected to capture multi-level lexical meaning. In this thesis, our objective is to examine the efficacy of distributed representations from NNs in encoding lexical meaning. Initially, we identify and formalize three levels of lexical semantics: \textit{local}, \textit{global}, and \textit{mixed} levels. Then, for each level, we evaluate language models by collecting or constructing multilingual datasets, leveraging various language models, and employing linguistic analysis theories. This thesis builds a bridge between computational models and lexical semantics, aiming to complement each other.
