Innovations in Neural Data-to-text Generation: A Survey
Mandar Sharma, Ajay Gogineni, Naren Ramakrishnan
TL;DR
This survey analyzes neural data-to-text generation (D2T), outlining how modern models convert structured data (MRs, graphs, tables) into natural language while navigating fidelity, coherence, and stylistic variation. It covers data representations, preprocessing, seq2seq and non-seq2seq innovations, and evolving evaluation practices, emphasizing the roles of delexicalization, linearization, graph encoders, plan-based generation, and PLM fine-tuning. Key contributions include a taxonomy of architectural and training strategies (entity/hierarchical/graph encoders, reconstruction, regularization, RL, templates), unsupervised pretraining tailored to D2T, and non-end-to-end approaches that improve interpretability and domain transfer. The paper also highlights reproducibility and fairness, advocates living benchmarks and domain-diverse datasets, and discusses future directions at the intersection of D2T with large language models, numerical reasoning, and external tools for computation and validation.
Abstract
The neural boom that has sparked natural language processing (NLP) research through the last decade has similarly led to significant innovations in data-to-text generation (DTG). This survey offers a consolidated view into the neural DTG paradigm with a structured examination of the approaches, benchmark datasets, and evaluation protocols. This survey draws boundaries separating DTG from the rest of the natural language generation (NLG) landscape, encompassing an up-to-date synthesis of the literature, and highlighting the stages of technological adoption from within and outside the greater NLG umbrella. With this holistic view, we highlight promising avenues for DTG research that not only focus on the design of linguistically capable systems but also systems that exhibit fairness and accountability.
