Table of Contents
Fetching ...

An Empirical Study on Context Length for Open-Domain Dialog Generation

Xinyi Shen, Zuoquan Lin

TL;DR

This paper investigates how the length of dialog history used as context affects Transformer-based open-domain dialog models. It compares training from scratch and GPT-2 fine-tuning on DailyDialog and PersonaChat, varying training and testing context lengths and evaluating with perplexity. Key findings show that longer context is not always better, the overall best model generalizes across history lengths, and selecting an optimal context length per sample at test time can yield notable improvements, though practical estimation remains an open challenge. The work provides guidance for balancing context length with computational cost and points toward adaptive context-length strategies for real-world dialog systems.

Abstract

Transformer-based open-domain dialog models have become increasingly popular in recent years. These models typically represent context as a concatenation of a dialog history. However, there is no criterion to decide how many utterances should be kept adequate in a context. We try to figure out how the choice of context length affects the model. We experiment on three questions from coarse to fine: (i) Does longer context help model training? (ii) Is it necessary to change the training context length when dealing with dialogs of different context lengths? (iii) Do different dialog samples have the same preference for context length? Our experimental results show that context length, an often overlooked setting, deserves attention when implementing Transformer-based dialog models.

An Empirical Study on Context Length for Open-Domain Dialog Generation

TL;DR

This paper investigates how the length of dialog history used as context affects Transformer-based open-domain dialog models. It compares training from scratch and GPT-2 fine-tuning on DailyDialog and PersonaChat, varying training and testing context lengths and evaluating with perplexity. Key findings show that longer context is not always better, the overall best model generalizes across history lengths, and selecting an optimal context length per sample at test time can yield notable improvements, though practical estimation remains an open challenge. The work provides guidance for balancing context length with computational cost and points toward adaptive context-length strategies for real-world dialog systems.

Abstract

Transformer-based open-domain dialog models have become increasingly popular in recent years. These models typically represent context as a concatenation of a dialog history. However, there is no criterion to decide how many utterances should be kept adequate in a context. We try to figure out how the choice of context length affects the model. We experiment on three questions from coarse to fine: (i) Does longer context help model training? (ii) Is it necessary to change the training context length when dealing with dialogs of different context lengths? (iii) Do different dialog samples have the same preference for context length? Our experimental results show that context length, an often overlooked setting, deserves attention when implementing Transformer-based dialog models.
Paper Structure (7 sections, 2 equations, 2 figures, 2 tables)

This paper contains 7 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Perplexity of models trained under different context length settings on the DailyDialog (left) and PersonaChat (right) test set. The x-axis represents the maximum number of dialog turns allowed in the context when training the model. 'x' means the perplexity gain of this context length is less than $0.1$.
  • Figure 2: The proportion of test samples that achieves optimal perplexity under different test context lengths. We present results of $\mathcal{D}_2$$\mathcal{D}_5$ and $\mathcal{D}_{\ge 10} (= \bigcup_{i \ge 10} \mathcal{D}_i)$, as representatives of samples with short, medium, and long context. We use Transformer and GPT2 trained under the setting of context length 10 as test models, respectively.