A Legal Framework for Natural Language Processing Model Training in Portugal

Rúben Almeida; Evelin Amorim

A Legal Framework for Natural Language Processing Model Training in Portugal

Rúben Almeida, Evelin Amorim

TL;DR

The paper addresses the lack of Portuguese-specific guidance for NLP development amid GDPR and copyright concerns associated with large language models. It proposes the first Portuguese NLP legal framework by mapping Portuguese and EU legislation to typical NLP workflows, providing licensing guidance and three use-case scenarios. Key contributions include a licensing survey of models and datasets, flowchart-guided use cases, and a bridge between computer science and law. The framework aims to reduce legal risk and accelerate compliant NLP research in Portugal within the EU regulatory landscape, including upcoming AI regulation.

Abstract

Recent advances in deep learning have promoted the advent of many computational systems capable of performing intelligent actions that, until then, were restricted to the human intellect. In the particular case of human languages, these advances allowed the introduction of applications like ChatGPT that are capable of generating coherent text without being explicitly programmed to do so. Instead, these models use large volumes of textual data to learn meaningful representations of human languages. Associated with these advances, concerns about copyright and data privacy infringements caused by these applications have emerged. Despite these concerns, the pace at which new natural language processing applications continued to be developed largely outperformed the introduction of new regulations. Today, communication barriers between legal experts and computer scientists motivate many unintentional legal infringements during the development of such applications. In this paper, a multidisciplinary team intends to bridge this communication gap and promote more compliant Portuguese NLP research by presenting a series of everyday NLP use cases, while highlighting the Portuguese legislation that may arise during its development.

A Legal Framework for Natural Language Processing Model Training in Portugal

TL;DR

Abstract

Paper Structure (18 sections, 3 figures, 1 table)

This paper contains 18 sections, 3 figures, 1 table.

Introduction
Related Work
Portuguese
European
Portuguese NLP: Quick Overview
Leveraging Brazilian Portuguese Resources
Portuguese Legal System
Scientific Exceptions
National Legislation
European Legislation
NLP Licensing System
Use Cases
Load Brazilian Portuguese Dataset From HuggingFace
Crawl Portuguese Websites to Produce a Large NLP Corpus
Use Tweets to Produce Political Profiles: The Facebook-Cambridge Analytica case
...and 3 more sections

Figures (3)

Figure 1: Flowchart summarizing the legal questions associated with the loading of a Non-EU dataset
Figure 2: Flowchart highlighting the legal considerations web crawling may arise.
Figure 3: Flowchart concerning the legal issues of processing sensitive data.

A Legal Framework for Natural Language Processing Model Training in Portugal

TL;DR

Abstract

A Legal Framework for Natural Language Processing Model Training in Portugal

Authors

TL;DR

Abstract

Table of Contents

Figures (3)