Table of Contents
Fetching ...

Incremental Processing in the Age of Non-Incremental Encoders: An Empirical Assessment of Bidirectional Models for Incremental NLU

Brielen Madureira, David Schlangen

TL;DR

This work investigates how bidirectional encoders behave under incremental interfaces, when partial output must be provided based on partial input seen up to a certain time step, which may happen in interactive systems.

Abstract

While humans process language incrementally, the best language encoders currently used in NLP do not. Both bidirectional LSTMs and Transformers assume that the sequence that is to be encoded is available in full, to be processed either forwards and backwards (BiLSTMs) or as a whole (Transformers). We investigate how they behave under incremental interfaces, when partial output must be provided based on partial input seen up to a certain time step, which may happen in interactive systems. We test five models on various NLU datasets and compare their performance using three incremental evaluation metrics. The results support the possibility of using bidirectional encoders in incremental mode while retaining most of their non-incremental quality. The "omni-directional" BERT model, which achieves better non-incremental performance, is impacted more by the incremental access. This can be alleviated by adapting the training regime (truncated training), or the testing procedure, by delaying the output until some right context is available or by incorporating hypothetical right contexts generated by a language model like GPT-2.

Incremental Processing in the Age of Non-Incremental Encoders: An Empirical Assessment of Bidirectional Models for Incremental NLU

TL;DR

This work investigates how bidirectional encoders behave under incremental interfaces, when partial output must be provided based on partial input seen up to a certain time step, which may happen in interactive systems.

Abstract

While humans process language incrementally, the best language encoders currently used in NLP do not. Both bidirectional LSTMs and Transformers assume that the sequence that is to be encoded is available in full, to be processed either forwards and backwards (BiLSTMs) or as a whole (Transformers). We investigate how they behave under incremental interfaces, when partial output must be provided based on partial input seen up to a certain time step, which may happen in interactive systems. We test five models on various NLU datasets and compare their performance using three incremental evaluation metrics. The results support the possibility of using bidirectional encoders in incremental mode while retaining most of their non-incremental quality. The "omni-directional" BERT model, which achieves better non-incremental performance, is impacted more by the incremental access. This can be alleviated by adapting the training regime (truncated training), or the testing procedure, by delaying the output until some right context is available or by incorporating hypothetical right contexts generated by a language model like GPT-2.

Paper Structure

This paper contains 13 sections, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Incremental interface on a bidirectional tagging model (here for chunking). Each line represents the input and output at a time step. Necessary additions are green/bold, substitutions are yellow/underlined, and the dashed frame shows the output of the final time step, which is the same as the non-incremental model's.
  • Figure 2: How we estimate the evaluation metrics for the complete sequence of outputs from Figure \ref{['fig:incremental']}.
  • Figure 3: Models for sequence tagging, w=word and l=label. (a) is the only inherently incremental. (a), (b) and (e) can also be used for sequence classification if we consider only their final representation.
  • Figure 4: Incremental interface of a non-incremental bidirectional model, showing the input and output at time step 3. The context vector fed into the backward LSTM can be zero or initialized with a hypothetical right context generated by a language model.
  • Figure 5: Example of the calculation of Edit Overhead with $\Delta1$ delay for the example in Figure \ref{['fig:metrics']}. The first choice for each label happens once the subsequent word has been observed, except for the last token in the sentence.
  • ...and 4 more figures