Natural Language Processing (almost) from Scratch
Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa
TL;DR
The paper introduces SENNA, a unified neural network framework that learns representations from unlabeled data to perform multiple NLP tagging tasks (POS, CHUNK, NER, SRL) with minimal task-specific feature engineering. It juxtaposes window-based and sentence-based (TDNN/conv) architectures, training with both word-level and sentence-level likelihoods, and demonstrates that unlabeled data pretraining substantially boosts performance when transferring to supervised tasks. It also empirically explores multi-task learning and modest task-specific engineering, showing that larger unlabeled corpora and simple, fast models can approach state-of-the-art results while remaining practical in speed and memory. The work highlights the practicality of an almost-from-scratch approach, delivering a compact, fast tagging system that leverages language-model-derived embeddings and selective engineering to achieve strong, scalable NLP performance.
Abstract
We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
