A Legal Framework for Natural Language Processing Model Training in Portugal
Rúben Almeida, Evelin Amorim
TL;DR
The paper addresses the lack of Portuguese-specific guidance for NLP development amid GDPR and copyright concerns associated with large language models. It proposes the first Portuguese NLP legal framework by mapping Portuguese and EU legislation to typical NLP workflows, providing licensing guidance and three use-case scenarios. Key contributions include a licensing survey of models and datasets, flowchart-guided use cases, and a bridge between computer science and law. The framework aims to reduce legal risk and accelerate compliant NLP research in Portugal within the EU regulatory landscape, including upcoming AI regulation.
Abstract
Recent advances in deep learning have promoted the advent of many computational systems capable of performing intelligent actions that, until then, were restricted to the human intellect. In the particular case of human languages, these advances allowed the introduction of applications like ChatGPT that are capable of generating coherent text without being explicitly programmed to do so. Instead, these models use large volumes of textual data to learn meaningful representations of human languages. Associated with these advances, concerns about copyright and data privacy infringements caused by these applications have emerged. Despite these concerns, the pace at which new natural language processing applications continued to be developed largely outperformed the introduction of new regulations. Today, communication barriers between legal experts and computer scientists motivate many unintentional legal infringements during the development of such applications. In this paper, a multidisciplinary team intends to bridge this communication gap and promote more compliant Portuguese NLP research by presenting a series of everyday NLP use cases, while highlighting the Portuguese legislation that may arise during its development.
