Table of Contents
Fetching ...

MetaphorShare: A Dynamic Collaborative Repository of Open Metaphor Datasets

Joanne Boisson, Arif Mehmood, Jose Camacho-Collados

TL;DR

MetaphorShare addresses fragmentation in metaphor research by offering an open, unified repository that unifies diverse metaphor datasets under a minimal CSV-based format with a robust tagging scheme. The platform provides four core functions—upload, download, search, and online labeling—paired with an Elasticsearch-backed search and a validation workflow to ensure data quality and interoperability. It demonstrates practical value through a cross-dataset evaluation using RoBERTa, illustrating how researchers can fine-tune models on specific datasets and generalize across resources. The work aims to foster interdisciplinary collaboration, expand multilingual coverage, and enable automatic or semi-automatic labeling of metaphors to accelerate NLP metaphor processing research.

Abstract

The metaphor studies community has developed numerous valuable labelled corpora in various languages over the years. Many of these resources are not only unknown to the NLP community, but are also often not easily shared among the researchers. Both in human sciences and in NLP, researchers could benefit from a centralised database of labelled resources, easily accessible and unified under an identical format. To facilitate this, we present MetaphorShare, a website to integrate metaphor datasets making them open and accessible. With this effort, our aim is to encourage researchers to share and upload more datasets in any language in order to facilitate metaphor studies and the development of future metaphor processing NLP systems. The website has four main functionalities: upload, download, search and label metaphor datasets. It is accessible at www.metaphorshare.com.

MetaphorShare: A Dynamic Collaborative Repository of Open Metaphor Datasets

TL;DR

MetaphorShare addresses fragmentation in metaphor research by offering an open, unified repository that unifies diverse metaphor datasets under a minimal CSV-based format with a robust tagging scheme. The platform provides four core functions—upload, download, search, and online labeling—paired with an Elasticsearch-backed search and a validation workflow to ensure data quality and interoperability. It demonstrates practical value through a cross-dataset evaluation using RoBERTa, illustrating how researchers can fine-tune models on specific datasets and generalize across resources. The work aims to foster interdisciplinary collaboration, expand multilingual coverage, and enable automatic or semi-automatic labeling of metaphors to accelerate NLP metaphor processing research.

Abstract

The metaphor studies community has developed numerous valuable labelled corpora in various languages over the years. Many of these resources are not only unknown to the NLP community, but are also often not easily shared among the researchers. Both in human sciences and in NLP, researchers could benefit from a centralised database of labelled resources, easily accessible and unified under an identical format. To facilitate this, we present MetaphorShare, a website to integrate metaphor datasets making them open and accessible. With this effort, our aim is to encourage researchers to share and upload more datasets in any language in order to facilitate metaphor studies and the development of future metaphor processing NLP systems. The website has four main functionalities: upload, download, search and label metaphor datasets. It is accessible at www.metaphorshare.com.

Paper Structure

This paper contains 33 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: MetaphorShare search page. Specific datasets, languages, and tag types can be selected, and a text-based search within tagged expressions or into the entire text is implemented. Additional features provided with the record appear when clicking the Show Details button.
  • Figure 2: Screenshot of the online annotation tool showing the text input area, tag selection and creation, and resulting tagged text highlighted in different colours.
  • Figure 3: Results of the cross dataset evaluation. F1-score of the metaphor class. Each training set contains 800 examples and the test sets sizes are shown on the $x$ axis.
  • Figure 4: Screenshot of top of the the datasets information page in the catalog section of the website. The English dataset released with Jankowiak2020 is presented as an example.
  • Figure 5: Screenshot of the file format check for a rejected file. The line the error occurs in the CSV file and the type of errors are specified.
  • ...and 2 more figures