Table of Contents
Fetching ...

Natural Language Processing (almost) from Scratch

Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa

TL;DR

The paper introduces SENNA, a unified neural network framework that learns representations from unlabeled data to perform multiple NLP tagging tasks (POS, CHUNK, NER, SRL) with minimal task-specific feature engineering. It juxtaposes window-based and sentence-based (TDNN/conv) architectures, training with both word-level and sentence-level likelihoods, and demonstrates that unlabeled data pretraining substantially boosts performance when transferring to supervised tasks. It also empirically explores multi-task learning and modest task-specific engineering, showing that larger unlabeled corpora and simple, fast models can approach state-of-the-art results while remaining practical in speed and memory. The work highlights the practicality of an almost-from-scratch approach, delivering a compact, fast tagging system that leverages language-model-derived embeddings and selective engineering to achieve strong, scalable NLP performance.

Abstract

We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.

Natural Language Processing (almost) from Scratch

TL;DR

The paper introduces SENNA, a unified neural network framework that learns representations from unlabeled data to perform multiple NLP tagging tasks (POS, CHUNK, NER, SRL) with minimal task-specific feature engineering. It juxtaposes window-based and sentence-based (TDNN/conv) architectures, training with both word-level and sentence-level likelihoods, and demonstrates that unlabeled data pretraining substantially boosts performance when transferring to supervised tasks. It also empirically explores multi-task learning and modest task-specific engineering, showing that larger unlabeled corpora and simple, fast models can approach state-of-the-art results while remaining practical in speed and memory. The work highlights the practicality of an almost-from-scratch approach, delivering a compact, fast tagging system that leverages language-model-derived embeddings and selective engineering to achieve strong, scalable NLP performance.

Abstract

We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.

Paper Structure

This paper contains 55 sections, 48 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: Window approach network.
  • Figure 2: Sentence approach network.
  • Figure 3: Number of features chosen at each word position by the Max layer. We consider a sentence approach network (Figure \ref{['fig-net-sentence']}) trained for SRL. The number of "local" features output by the convolution layer is $300$per word. By applying a Max over the sentence, we obtain $300$ features for the whole sentence. It is interesting to see that the network catches features mostly around the verb of interest (here "report") and word of interest ("proposed" (left) or "often" (right)).
  • Figure 4: F1 score on the validation set (y-axis) versus number of hidden units (x-axis) for different tasks trained with the sentence-level likelihood (SLL), as in Table \ref{['tbl-res-nn']}. For SRL, we vary in this graph only the number of hidden units in the second layer. The scale is adapted for each task. We show the standard deviation (obtained over 5 runs with different random initialization), for the architecture we picked (300 hidden units for POS, CHUNK and NER, 500 for SRL).
  • Figure 5: Example of multitasking with NN. Task 1 and Task 2 are two tasks trained with the window approach architecture presented in Figure \ref{['fig-net-window']}. Lookup tables as well as the first hidden layer are shared. The last layer is task specific. The principle is the same with more than two tasks.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Remark 1: Border Effects
  • Remark 2
  • Remark 3: Graph Transformer Networks
  • Remark 4: Conditional Random Fields
  • Remark 5: Differentiability
  • Remark 6: Modular Approach
  • Remark 7: Tricks
  • Remark 8: Architectures
  • Remark 9: Training Time