Table of Contents
Fetching ...

Standardising the NLP Workflow: A Framework for Reproducible Linguistic Analysis

Yves Pauli, Jan-Bernard Marsman, Finn Rabe, Victoria Edkins, Roya Hüppi, Silvia Ciampelli, Akhil Ratan Misra, Nils Lang, Wolfram Hinzen, Iris Sommer, Philipp Homan

TL;DR

The paper tackles the lack of standardisation and reproducibility in quantitative linguistic analysis by introducing LPDS, a Brain Imaging Data Structure-inspired data schema, and pelican_nlp, a modular, YAML-configured processing package. Together, LPDS standardises data storage and naming while pelican_nlp orchestrates an end-to-end pipeline for preprocessing and extraction of linguistic and acoustic features from LPDS-formatted data. The contributions include concrete LPDS specifications, a configurable open-source processing framework, and provenance-enabled workflows that support cross-site and longitudinal studies, aligning with FAIR principles. By enabling transparent, reusable, and interoperable analyses across disciplines and languages, the framework aims to maximize reproducibility and comparability in linguistic research.

Abstract

The introduction of large language models and other influential developments in AI-based language processing have led to an evolution in the methods available to quantitatively analyse language data. With the resultant growth of attention on language processing, significant challenges have emerged, including the lack of standardisation in organising and sharing linguistic data and the absence of standardised and reproducible processing methodologies. Striving for future standardisation, we first propose the Language Processing Data Structure (LPDS), a data structure inspired by the Brain Imaging Data Structure (BIDS), a widely adopted standard for handling neuroscience data. It provides a folder structure and file naming conventions for linguistic research. Second, we introduce pelican nlp, a modular and extensible Python package designed to enable streamlined language processing, from initial data cleaning and task-specific preprocessing to the extraction of sophisticated linguistic and acoustic features, such as semantic embeddings and prosodic metrics. The entire processing workflow can be specified within a single, shareable configuration file, which pelican nlp then executes on LPDS-formatted data. Depending on the specifications, the reproducible output can consist of preprocessed language data or standardised extraction of both linguistic and acoustic features and corresponding result aggregations. LPDS and pelican nlp collectively offer an end-to-end processing pipeline for linguistic data, designed to ensure methodological transparency and enhance reproducibility.

Standardising the NLP Workflow: A Framework for Reproducible Linguistic Analysis

TL;DR

The paper tackles the lack of standardisation and reproducibility in quantitative linguistic analysis by introducing LPDS, a Brain Imaging Data Structure-inspired data schema, and pelican_nlp, a modular, YAML-configured processing package. Together, LPDS standardises data storage and naming while pelican_nlp orchestrates an end-to-end pipeline for preprocessing and extraction of linguistic and acoustic features from LPDS-formatted data. The contributions include concrete LPDS specifications, a configurable open-source processing framework, and provenance-enabled workflows that support cross-site and longitudinal studies, aligning with FAIR principles. By enabling transparent, reusable, and interoperable analyses across disciplines and languages, the framework aims to maximize reproducibility and comparability in linguistic research.

Abstract

The introduction of large language models and other influential developments in AI-based language processing have led to an evolution in the methods available to quantitatively analyse language data. With the resultant growth of attention on language processing, significant challenges have emerged, including the lack of standardisation in organising and sharing linguistic data and the absence of standardised and reproducible processing methodologies. Striving for future standardisation, we first propose the Language Processing Data Structure (LPDS), a data structure inspired by the Brain Imaging Data Structure (BIDS), a widely adopted standard for handling neuroscience data. It provides a folder structure and file naming conventions for linguistic research. Second, we introduce pelican nlp, a modular and extensible Python package designed to enable streamlined language processing, from initial data cleaning and task-specific preprocessing to the extraction of sophisticated linguistic and acoustic features, such as semantic embeddings and prosodic metrics. The entire processing workflow can be specified within a single, shareable configuration file, which pelican nlp then executes on LPDS-formatted data. Depending on the specifications, the reproducible output can consist of preprocessed language data or standardised extraction of both linguistic and acoustic features and corresponding result aggregations. LPDS and pelican nlp collectively offer an end-to-end processing pipeline for linguistic data, designed to ensure methodological transparency and enhance reproducibility.

Paper Structure

This paper contains 22 sections, 2 figures.

Figures (2)

  • Figure 1: Framework of the pelican_nlp package. The framework shows processing details from command line interface to linguistic feature output. Green boxes represent core processing files. Blue boxes correspond to the main components of the package. Yellow boxes correspond to files related to data preprocessing. Red boxes correspond to files related to linguistic feature extraction. Grey boxes correspond to package utility files.
  • Figure 3: Workflow diagram illustrating how to use the pelican_nlp package. Orange boxes highlight the steps in which user input is required. The original dataset (a) is transformed into Language Processing Data Structure (LPDS) format (b). The transformed dataset (c) and the created/chosen configuration file (d) represent the pipeline input (e). Executing the terminal command pelican-run package on the pipeline input (f) executes the pelican_nlp package (g). The pelican_nlp package will then calculate and output linguistic metrics (h) and store them in csv format.