Table of Contents
Fetching ...

CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models

Zaid Sheikh, Antonios Anastasopoulos, Shruti Rijhwani, Lindia Tjuatja, Robbie Jimerson, Graham Neubig

TL;DR

CMULAB presents an open-source, web-based framework that lowers barriers to applying state-of-the-art NLP to under-resourced languages by combining pre-trained multilingual models, a user-friendly interface, and human-in-the-loop fine-tuning. Its modular backend (Django) and plugin-based model registry, coupled with Redis-backed task queues and Docker scalability, enable rapid adaptation for tasks like OCR post-correction, phoneme recognition, speaker diarization, MT, and interlinear glossing, all accessible through REST APIs and ELAN integration. A guiding case study on Seneca demonstrates tangible gains in OCR accuracy through iterative training. The work aims to democratize NLP tooling for language communities, with future enhancements including active learning, evaluation tools, version history, and finer-grained access controls.

Abstract

Effectively using Natural Language Processing (NLP) tools in under-resourced languages requires a thorough understanding of the language itself, familiarity with the latest models and training methodologies, and technical expertise to deploy these models. This could present a significant obstacle for language community members and linguists to use NLP tools. This paper introduces the CMU Linguistic Annotation Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models. CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages, even with limited training data. We describe various tools and APIs that are currently available and how developers can easily add new models/functionality to the framework. Code is available at https://github.com/neulab/cmulab along with a live demo at https://cmulab.dev

CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models

TL;DR

CMULAB presents an open-source, web-based framework that lowers barriers to applying state-of-the-art NLP to under-resourced languages by combining pre-trained multilingual models, a user-friendly interface, and human-in-the-loop fine-tuning. Its modular backend (Django) and plugin-based model registry, coupled with Redis-backed task queues and Docker scalability, enable rapid adaptation for tasks like OCR post-correction, phoneme recognition, speaker diarization, MT, and interlinear glossing, all accessible through REST APIs and ELAN integration. A guiding case study on Seneca demonstrates tangible gains in OCR accuracy through iterative training. The work aims to democratize NLP tooling for language communities, with future enhancements including active learning, evaluation tools, version history, and finer-grained access controls.

Abstract

Effectively using Natural Language Processing (NLP) tools in under-resourced languages requires a thorough understanding of the language itself, familiarity with the latest models and training methodologies, and technical expertise to deploy these models. This could present a significant obstacle for language community members and linguists to use NLP tools. This paper introduces the CMU Linguistic Annotation Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models. CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages, even with limited training data. We describe various tools and APIs that are currently available and how developers can easily add new models/functionality to the framework. Code is available at https://github.com/neulab/cmulab along with a live demo at https://cmulab.dev
Paper Structure (14 sections, 5 figures, 1 table)

This paper contains 14 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Architecture Diagram
  • Figure 2: CMULAB Homepage
  • Figure 3: CMULAB Models page
  • Figure 4: Machine Translation UI
  • Figure 5: CMULAB OCR post-correction tool