NLP Workbench: Efficient and Extensible Integration of State-of-the-art Text Mining Tools

Peiran Yao; Matej Kosmajac; Abeer Waheed; Kostyantyn Guzhva; Natalie Hervieux; Denilson Barbosa

NLP Workbench: Efficient and Extensible Integration of State-of-the-art Text Mining Tools

Peiran Yao, Matej Kosmajac, Abeer Waheed, Kostyantyn Guzhva, Natalie Hervieux, Denilson Barbosa

TL;DR

NLP Workbench addresses the barrier non-experts face when applying state-of-the-art NLP to large corpora by providing a web-based, extensible platform that unifies corpus management, text mining tools, and visualization. It relies on a containerized, microservice architecture with DAG-based pipelining and a distributed execution model, enabling efficient, parallelized computation and reuse of intermediate results via Elasticsearch indexing. Core contributions include modular, replaceable components for NER, coreference, entity linking, relation extraction, semantic parsing, summarization, sentiment analysis, and social network analysis, all accessible through REST/RPC interfaces and a browser extension. The platform supports diverse use cases in Digital Humanities, Business Analytics, and NLP Research, and is released under the MIT license to promote reproducibility and collaboration. Overall, NLP Workbench offers a scalable, user-friendly, and extensible framework that brings cutting-edge text-mining models to non-experts while enabling advanced researchers to integrate new tools and pipelines easily.

Abstract

NLP Workbench is a web-based platform for text mining that allows non-expert users to obtain semantic understanding of large-scale corpora using state-of-the-art text mining models. The platform is built upon latest pre-trained models and open source systems from academia that provide semantic analysis functionalities, including but not limited to entity linking, sentiment analysis, semantic parsing, and relation extraction. Its extensible design enables researchers and developers to smoothly replace an existing model or integrate a new one. To improve efficiency, we employ a microservice architecture that facilitates allocation of acceleration hardware and parallelization of computation. This paper presents the architecture of NLP Workbench and discusses the challenges we faced in designing it. We also discuss diverse use cases of NLP Workbench and the benefits of using it over other approaches. The platform is under active development, with its source code released under the MIT license. A website and a short video demonstrating our platform are also available.

NLP Workbench: Efficient and Extensible Integration of State-of-the-art Text Mining Tools

TL;DR

Abstract

Paper Structure (31 sections, 3 figures)

This paper contains 31 sections, 3 figures.

Introduction
Platform
Interaction
Architecture
Interface
Related Work
Architecture
Workflow
User Perspective
System Perspective
Pipelining and Scheduling
Containerized Microservices
Components
Named Entity Recognition
Coreference Resolution
...and 16 more sections

Figures (3)

Figure 1: Workflow of NLP Workbench from the perspectives of the user and the system, as described in §\ref{['sec:workflow']}. For document-level visualization, we showcase the user interface for named entity recognition, coreference resolution, entity linking, and semantic parsing. For corpus-level visualization, this figure includes plots of a social network constructed from a tweet of the official Nobel Prize account, and the distribution of sentiment polarity scores of a sample of the built-in news corpus. Icons created by Freepik - Flaticon.
Figure 2: Microservice architecture of NLP Workbench. Each rectangle represents a physical machine, with its capability indicated by the icon at the bottom right corner. Each rounded rectangle represents a container, with the tool and function it provides indicated by the text inside. Container to physical machine allocation is for illustration purposes only and is adjusted to fit the need when the system is deployed in production.
Figure 3: Example of a batched task with the directed acyclic graphs of dependencies. Shaded nodes represent tools that the user requests to run on the document, and unshaded nodes represent tools that are needed to provide the inputs to the shaded nodes.

NLP Workbench: Efficient and Extensible Integration of State-of-the-art Text Mining Tools

TL;DR

Abstract

NLP Workbench: Efficient and Extensible Integration of State-of-the-art Text Mining Tools

Authors

TL;DR

Abstract

Table of Contents

Figures (3)