Natural Language Generation
Emiel van Miltenburg, Chenghua Lin
TL;DR
Natural Language Generation (NLG) deals with automatic text generation from structured or unstructured sources. The paper surveys classical pipeline approaches, end-to-end and LLM-based methods, and evaluations, highlighting trade-offs between fluency and factuality and the challenges of long-form coherence and reproducibility. It discusses diverse applications in business, journalism, and medicine, and emphasizes ethical and social considerations such as dual-use and model openness. The work offers a structured synthesis of methods, evaluation practices, and open questions to guide responsible development and benchmarking of NLG systems.
Abstract
This article provides a brief overview of the field of Natural Language Generation. The term Natural Language Generation (NLG), in its broadest definition, refers to the study of systems that verbalize some form of information through natural language. That information could be stored in a large database or knowledge graph (in data-to-text applications), but NLG researchers may also study summarisation (text-to-text) or image captioning (image-to-text), for example. As a subfield of Natural Language Processing, NLG is closely related to other sub-disciplines such as Machine Translation (MT) and Dialog Systems. Some NLG researchers exclude MT from their definition of the field, since there is no content selection involved where the system has to determine what to say. Conversely, dialog systems do not typically fall under the header of Natural Language Generation since NLG is just one component of dialog systems (the others being Natural Language Understanding and Dialog Management). However, with the rise of Large Language Models (LLMs), different subfields of Natural Language Processing have converged on similar methodologies for the production of natural language and the evaluation of automatically generated text.
