Table of Contents
Fetching ...

The Pilot Corpus of the English Semantic Sketches

Maria Petrova, Maria Ponomareva, Alexandra Ivoylova

TL;DR

The paper develops the English SemSketches pilot corpus of semantically annotated verb sketches and provides parallel Russian sketches to enable cross-language analysis. It builds 100 English sketches from a large corpus using Compreno semantic markup to produce interpretable verb portraits, paired with corresponding Russian sketches for comparison, enabling investigation of polysemy and word sense disambiguation (WSD). Key contributions include the bilingual sketch corpus (113 EN-RU pairs) and a detailed examination of cross-language differences, common sketching mistakes, and their linguistic implications. The work sets the stage for expanding the dataset, adding more semantic slots, and integrating with broader corpora like GICR, with significant potential to advance lexical semantics and cross-lingual NLP tasks, particularly WSD.

Abstract

The paper is devoted to the creation of the semantic sketches for English verbs. The pilot corpus consists of the English-Russian sketch pairs and is aimed to show what kind of contrastive studies the sketches help to conduct. Special attention is paid to the cross-language differences between the sketches with similar semantics. Moreover, we discuss the process of building a semantic sketch, and analyse the mistakes that could give insight to the linguistic nature of sketches.

The Pilot Corpus of the English Semantic Sketches

TL;DR

The paper develops the English SemSketches pilot corpus of semantically annotated verb sketches and provides parallel Russian sketches to enable cross-language analysis. It builds 100 English sketches from a large corpus using Compreno semantic markup to produce interpretable verb portraits, paired with corresponding Russian sketches for comparison, enabling investigation of polysemy and word sense disambiguation (WSD). Key contributions include the bilingual sketch corpus (113 EN-RU pairs) and a detailed examination of cross-language differences, common sketching mistakes, and their linguistic implications. The work sets the stage for expanding the dataset, adding more semantic slots, and integrating with broader corpora like GICR, with significant potential to advance lexical semantics and cross-lingual NLP tasks, particularly WSD.

Abstract

The paper is devoted to the creation of the semantic sketches for English verbs. The pilot corpus consists of the English-Russian sketch pairs and is aimed to show what kind of contrastive studies the sketches help to conduct. Special attention is paid to the cross-language differences between the sketches with similar semantics. Moreover, we discuss the process of building a semantic sketch, and analyse the mistakes that could give insight to the linguistic nature of sketches.

Paper Structure

This paper contains 11 sections, 14 figures.

Figures (14)

  • Figure 1: The sketch for the verb ‘to focus:TO_FOCUS’
  • Figure 2: The sketch for the verb ‘to explode:TO_BLOW_UP’
  • Figure 3: The sketch for the verb ‘взорвать’:TO_BLOW_UP’
  • Figure 4: The sketch for the verb ‘to inflict:TO_BRING_STATE_TO_SMB’
  • Figure 5: The sketch for the verb ’доставлять:TO_BRING_STATE_TO_SMB’
  • ...and 9 more figures