The Pilot Corpus of the English Semantic Sketches
Maria Petrova, Maria Ponomareva, Alexandra Ivoylova
TL;DR
The paper develops the English SemSketches pilot corpus of semantically annotated verb sketches and provides parallel Russian sketches to enable cross-language analysis. It builds 100 English sketches from a large corpus using Compreno semantic markup to produce interpretable verb portraits, paired with corresponding Russian sketches for comparison, enabling investigation of polysemy and word sense disambiguation (WSD). Key contributions include the bilingual sketch corpus (113 EN-RU pairs) and a detailed examination of cross-language differences, common sketching mistakes, and their linguistic implications. The work sets the stage for expanding the dataset, adding more semantic slots, and integrating with broader corpora like GICR, with significant potential to advance lexical semantics and cross-lingual NLP tasks, particularly WSD.
Abstract
The paper is devoted to the creation of the semantic sketches for English verbs. The pilot corpus consists of the English-Russian sketch pairs and is aimed to show what kind of contrastive studies the sketches help to conduct. Special attention is paid to the cross-language differences between the sketches with similar semantics. Moreover, we discuss the process of building a semantic sketch, and analyse the mistakes that could give insight to the linguistic nature of sketches.
