SemSketches-2021: experimenting with the machine processing of the pilot semantic sketches corpus
Maria Ponomareva, Maria Petrova, Julia Detkova, Oleg Serikov, Maria Yarova
TL;DR
The paper introduces SemSketches-2021, a pilot Russian semantic-sketch corpus and a Shared Task to map anonymous sketches to predicate contexts, bridging human-interpretable semantic representations with automated processing. It details data construction from a Russian corpus, evaluation protocols, a weak baseline, and three LM-based systems (LM-score, Sentence-BERT similarity, and MLM-based predicate restoration), along with an analysis of their performance and error patterns. The work demonstrates the value of full semantic markup for SRL and WSD tasks, highlights challenges such as polysemy and homonym confusion, and proposes integration with the General Internet-Corpus of Russian (GICR) and multilingual expansion as future directions. Collectively, the SemSketches effort provides a foundation for linguistically informed probing and fine-tuning tasks for pretrained models and for advancing automatic semantic analysis in NLP applications.
Abstract
The paper deals with elaborating different approaches to the machine processing of semantic sketches. It presents the pilot open corpus of semantic sketches. Different aspects of creating the sketches are discussed, as well as the tasks that the sketches can help to solve. Special attention is paid to the creation of the machine processing tools for the corpus. For this purpose, the SemSketches-2021 Shared Task was organized. The participants were given the anonymous sketches and a set of contexts containing the necessary predicates. During the Task, one had to assign the proper contexts to the corresponding sketches.
