Table of Contents
Fetching ...

Explicit Word Density Estimation for Language Modelling

Jovan Andonov, Octavian Ganea, Paulina Grnarova, Gary Bécigneul, Thomas Hofmann

TL;DR

The paper tackles the limited expressiveness of Softmax-based language models by reframing word prediction as explicit word-density estimation through Neural Ordinary Differential Equations (Neural ODEs) and Continuous Normalizing Flows (CNFs). By integrating ODENets and CNFs, it enables context-dependent, non-linear transformations of logits, addressing both the Softmax Bottleneck and Single Transformation limitations. The study introduces several LM architectures, including NeuralODE logit transformations and CNF-based LMs (including context-conditioned variants), and evaluates them on standard benchmarks, showing selective improvements over strong baselines like AWD-LSTM, MoS, and DoC, especially under careful training regimes. The work highlights significant computational trade-offs but demonstrates that continuous-depth and flow-based approaches can yield richer output densities, potentially advancing practical language modelling with explicit density estimation.

Abstract

Language Modelling has been a central part of Natural Language Processing for a very long time and in the past few years LSTM-based language models have been the go-to method for commercial language modeling. Recently, it has been shown that when looking at language modelling from a matrix factorization point of view, the final Softmax layer limits the expressiveness of the model, by putting an upper bound on the rank of the resulting matrix. Additionally, a new family of neural networks based called NeuralODEs, has been introduced as a continuous alternative to Residual Networks. Moreover, it has been shown that there is a connection between these models and Normalizing Flows. In this work we propose a new family of language models based on NeuralODEs and the continuous analogue of Normalizing Flows and manage to improve on some of the baselines.

Explicit Word Density Estimation for Language Modelling

TL;DR

The paper tackles the limited expressiveness of Softmax-based language models by reframing word prediction as explicit word-density estimation through Neural Ordinary Differential Equations (Neural ODEs) and Continuous Normalizing Flows (CNFs). By integrating ODENets and CNFs, it enables context-dependent, non-linear transformations of logits, addressing both the Softmax Bottleneck and Single Transformation limitations. The study introduces several LM architectures, including NeuralODE logit transformations and CNF-based LMs (including context-conditioned variants), and evaluates them on standard benchmarks, showing selective improvements over strong baselines like AWD-LSTM, MoS, and DoC, especially under careful training regimes. The work highlights significant computational trade-offs but demonstrates that continuous-depth and flow-based approaches can yield richer output densities, potentially advancing practical language modelling with explicit density estimation.

Abstract

Language Modelling has been a central part of Natural Language Processing for a very long time and in the past few years LSTM-based language models have been the go-to method for commercial language modeling. Recently, it has been shown that when looking at language modelling from a matrix factorization point of view, the final Softmax layer limits the expressiveness of the model, by putting an upper bound on the rank of the resulting matrix. Additionally, a new family of neural networks based called NeuralODEs, has been introduced as a continuous alternative to Residual Networks. Moreover, it has been shown that there is a connection between these models and Normalizing Flows. In this work we propose a new family of language models based on NeuralODEs and the continuous analogue of Normalizing Flows and manage to improve on some of the baselines.
Paper Structure (40 sections, 82 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 40 sections, 82 equations, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: Neighborhood of car in the embedding space of AWD-LSTM.
  • Figure 2: Neighborhood of computer in the embedding space of AWD-LSTM.
  • Figure 3: Left: A Residual network defines a discrete sequence of finite transformations. Right: An ODE network defines a vector field that continuously transforms the state chen2018neural.
  • Figure 4: General purpose ODENet. $f(z(t), \ t; \ \theta)$ is neural network specifying the differential equation, $z(t_0)$ is the initial state, $t_0$ is the initial time, $t_1$ is the final time, $z(t_1)$ is the final state and $L$ is a scalar valued loss function.
  • Figure 5: Reverse-mode differentiation of an ODE solution. The adjoint sensitivity method solves an augmented ODE backwards in time. The red lines denote the sequence of separate ODE solves chen2018neural.
  • ...and 3 more figures