Table of Contents
Fetching ...

pysentimiento: A Python Toolkit for Opinion Mining and Social NLP tasks

Juan Manuel Pérez, Mariela Rajngewerc, Juan Carlos Giudici, Damián A. Furman, Franco Luque, Laura Alonso Alemany, María Vanina Martínez

TL;DR

The paper introduces pysentimiento, a multilingual Python toolkit for opinion mining and Social NLP tasks designed to democratize access to state-of-the-art models across Spanish, English, Italian, and Portuguese. It systematically evaluates a range of language-specific pretrained models on four tasks (sentiment, emotion, hate speech, irony), using careful preprocessing and fine-tuning with multiple seeds, and incorporates a fairness assessment using the Equity Evaluation Corpus. The authors demonstrate that specialized social-media models generally outperform general-domain baselines and compare pysentimiento favorably against other open-source tools, while releasing the best-performing models for community use. They also discuss limitations, fairness considerations, and future work to broaden language coverage and expand utilities beyond sentiment to other information-extraction tasks. The work has practical impact by providing an accessible, open-source framework that researchers can rapidly adopt for multilingual social-media analysis with built-in model selection, evaluation, and fairness considerations.

Abstract

In recent years, the extraction of opinions and information from user-generated text has attracted a lot of interest, largely due to the unprecedented volume of content in Social Media. However, social researchers face some issues in adopting cutting-edge tools for these tasks, as they are usually behind commercial APIs, unavailable for other languages than English, or very complex to use for non-experts. To address these issues, we present pysentimiento, a comprehensive multilingual Python toolkit designed for opinion mining and other Social NLP tasks. This open-source library brings state-of-the-art models for Spanish, English, Italian, and Portuguese in an easy-to-use Python library, allowing researchers to leverage these techniques. We present a comprehensive assessment of performance for several pre-trained language models across a variety of tasks, languages, and datasets, including an evaluation of fairness in the results.

pysentimiento: A Python Toolkit for Opinion Mining and Social NLP tasks

TL;DR

The paper introduces pysentimiento, a multilingual Python toolkit for opinion mining and Social NLP tasks designed to democratize access to state-of-the-art models across Spanish, English, Italian, and Portuguese. It systematically evaluates a range of language-specific pretrained models on four tasks (sentiment, emotion, hate speech, irony), using careful preprocessing and fine-tuning with multiple seeds, and incorporates a fairness assessment using the Equity Evaluation Corpus. The authors demonstrate that specialized social-media models generally outperform general-domain baselines and compare pysentimiento favorably against other open-source tools, while releasing the best-performing models for community use. They also discuss limitations, fairness considerations, and future work to broaden language coverage and expand utilities beyond sentiment to other information-extraction tasks. The work has practical impact by providing an accessible, open-source framework that researchers can rapidly adopt for multilingual social-media analysis with built-in model selection, evaluation, and fairness considerations.

Abstract

In recent years, the extraction of opinions and information from user-generated text has attracted a lot of interest, largely due to the unprecedented volume of content in Social Media. However, social researchers face some issues in adopting cutting-edge tools for these tasks, as they are usually behind commercial APIs, unavailable for other languages than English, or very complex to use for non-experts. To address these issues, we present pysentimiento, a comprehensive multilingual Python toolkit designed for opinion mining and other Social NLP tasks. This open-source library brings state-of-the-art models for Spanish, English, Italian, and Portuguese in an easy-to-use Python library, allowing researchers to leverage these techniques. We present a comprehensive assessment of performance for several pre-trained language models across a variety of tasks, languages, and datasets, including an evaluation of fairness in the results.

Paper Structure

This paper contains 18 sections, 2 equations, 1 figure, 8 tables.

Figures (1)

  • Figure 1: Description of the process presented in this paper. We selected datasets for each considered task and language pair, pre-processed them to adequate them to the format expected to train models, fine-tuned several underlying models with these datasets, compared their performance with a common benchmark, and integrated the best models in the final release of the tool and benchmarked several models on them. Selected models are deployed in the huggingface hub, and can be easily used through the library.