Bridging the Gap for Tokenizer-Free Language Models
Dokook Choe, Rami Al-Rfou, Mandy Guo, Heeyoung Lee, Noah Constant
TL;DR
This document provides EMNLP-IJCNLP 2019 submission and formatting guidelines to ensure consistent presentation across proceedings. It prescribes a two-column A4 layout, PDF format with embedded fonts, and a double-blind review process, reinforced by the use of official LaTeX/Word templates. It covers structure requirements (title, abstract, sections), figure and table placement, URL formatting, references, and the handling of supplementary material and accessibility considerations. By enforcing these rules, the guidelines aim to facilitate fair evaluation, reproducibility, and uniform appearance across all accepted papers.
Abstract
Purely character-based language models (LMs) have been lagging in quality on large scale datasets, and current state-of-the-art LMs rely on word tokenization. It has been assumed that injecting the prior knowledge of a tokenizer into the model is essential to achieving competitive results. In this paper, we show that contrary to this conventional wisdom, tokenizer-free LMs with sufficient capacity can achieve competitive performance on a large scale dataset. We train a vanilla transformer network with 40 self-attention layers on the One Billion Word (lm1b) benchmark and achieve a new state of the art for tokenizer-free LMs, pushing these models to be on par with their word-based counterparts.
