Table of Contents
Fetching ...

SemSketches-2021: experimenting with the machine processing of the pilot semantic sketches corpus

Maria Ponomareva, Maria Petrova, Julia Detkova, Oleg Serikov, Maria Yarova

TL;DR

The paper introduces SemSketches-2021, a pilot Russian semantic-sketch corpus and a Shared Task to map anonymous sketches to predicate contexts, bridging human-interpretable semantic representations with automated processing. It details data construction from a Russian corpus, evaluation protocols, a weak baseline, and three LM-based systems (LM-score, Sentence-BERT similarity, and MLM-based predicate restoration), along with an analysis of their performance and error patterns. The work demonstrates the value of full semantic markup for SRL and WSD tasks, highlights challenges such as polysemy and homonym confusion, and proposes integration with the General Internet-Corpus of Russian (GICR) and multilingual expansion as future directions. Collectively, the SemSketches effort provides a foundation for linguistically informed probing and fine-tuning tasks for pretrained models and for advancing automatic semantic analysis in NLP applications.

Abstract

The paper deals with elaborating different approaches to the machine processing of semantic sketches. It presents the pilot open corpus of semantic sketches. Different aspects of creating the sketches are discussed, as well as the tasks that the sketches can help to solve. Special attention is paid to the creation of the machine processing tools for the corpus. For this purpose, the SemSketches-2021 Shared Task was organized. The participants were given the anonymous sketches and a set of contexts containing the necessary predicates. During the Task, one had to assign the proper contexts to the corresponding sketches.

SemSketches-2021: experimenting with the machine processing of the pilot semantic sketches corpus

TL;DR

The paper introduces SemSketches-2021, a pilot Russian semantic-sketch corpus and a Shared Task to map anonymous sketches to predicate contexts, bridging human-interpretable semantic representations with automated processing. It details data construction from a Russian corpus, evaluation protocols, a weak baseline, and three LM-based systems (LM-score, Sentence-BERT similarity, and MLM-based predicate restoration), along with an analysis of their performance and error patterns. The work demonstrates the value of full semantic markup for SRL and WSD tasks, highlights challenges such as polysemy and homonym confusion, and proposes integration with the General Internet-Corpus of Russian (GICR) and multilingual expansion as future directions. Collectively, the SemSketches effort provides a foundation for linguistically informed probing and fine-tuning tasks for pretrained models and for advancing automatic semantic analysis in NLP applications.

Abstract

The paper deals with elaborating different approaches to the machine processing of semantic sketches. It presents the pilot open corpus of semantic sketches. Different aspects of creating the sketches are discussed, as well as the tasks that the sketches can help to solve. Special attention is paid to the creation of the machine processing tools for the corpus. For this purpose, the SemSketches-2021 Shared Task was organized. The participants were given the anonymous sketches and a set of contexts containing the necessary predicates. During the Task, one had to assign the proper contexts to the corresponding sketches.

Paper Structure

This paper contains 16 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: the sketch for the verb <<страдать:SUFFERING_AND_TORMENT>> (‘to suffer’). Here the elements of the sketch are given with their rough translations.
  • Figure 2: the sketch for the verb <<готовить:TO_PREPARE_FOOD_SUBSTANCE>> (‘to prepare food, to cook’). Here the elements of the sketch are given with their rough translations.
  • Figure 3: the semantic sketch for the verb <<выходить:идти:TO_WALK>> 'to go out'). Here the elements of the sketch are given with their rough translations.