Explicit Word Density Estimation for Language Modelling
Jovan Andonov, Octavian Ganea, Paulina Grnarova, Gary Bécigneul, Thomas Hofmann
TL;DR
The paper tackles the limited expressiveness of Softmax-based language models by reframing word prediction as explicit word-density estimation through Neural Ordinary Differential Equations (Neural ODEs) and Continuous Normalizing Flows (CNFs). By integrating ODENets and CNFs, it enables context-dependent, non-linear transformations of logits, addressing both the Softmax Bottleneck and Single Transformation limitations. The study introduces several LM architectures, including NeuralODE logit transformations and CNF-based LMs (including context-conditioned variants), and evaluates them on standard benchmarks, showing selective improvements over strong baselines like AWD-LSTM, MoS, and DoC, especially under careful training regimes. The work highlights significant computational trade-offs but demonstrates that continuous-depth and flow-based approaches can yield richer output densities, potentially advancing practical language modelling with explicit density estimation.
Abstract
Language Modelling has been a central part of Natural Language Processing for a very long time and in the past few years LSTM-based language models have been the go-to method for commercial language modeling. Recently, it has been shown that when looking at language modelling from a matrix factorization point of view, the final Softmax layer limits the expressiveness of the model, by putting an upper bound on the rank of the resulting matrix. Additionally, a new family of neural networks based called NeuralODEs, has been introduced as a continuous alternative to Residual Networks. Moreover, it has been shown that there is a connection between these models and Normalizing Flows. In this work we propose a new family of language models based on NeuralODEs and the continuous analogue of Normalizing Flows and manage to improve on some of the baselines.
