Table of Contents
Fetching ...

Self-training from Self-memory in Data-to-text Generation

Hoang-Thang Ta

TL;DR

Self-training from Self-memory (STSM) tackles data-to-text generation by coaching a D2T model on subsets that include self-memory outputs inferred from trained models and newly added data. A greedy target-optimization and selective filtering via a T2D validator produce compact targets that preserve all source values, enabling competitive performance with only 30% of the data. Across DART and E2E NLG, the best results arise from combining self-memory with new data, while self-training on the T2D model offers limited gains, suggesting a favorable path for continual learning with reduced labeled data. The approach provides a practical, data-efficient CTO-style workflow for DTG, and the authors make their dataset publicly available for replication and further study.

Abstract

This paper introduces a novel training model, self-training from self-memory (STSM) in data-to-text generation (DTG), allowing the model to self-train on subsets, including self-memory as outputs inferred directly from the trained models and/or the new data. The quality of self-memory is validated by two models, data-to-text (D2T) and text-to-data (T2D), by two pre-defined conditions: (1) the appearance of all source values in the outputs of the D2T model and (2) the ability to convert back to source data in the outputs in the T2D model. We utilize a greedy algorithm to generate shorter D2T outputs if they contain all source values. Subsequently, we use the T2D model to confirm that these outputs can capture input relationships by demonstrating their capacity to convert text back into data. With 30% of the dataset, we can train the D2T model with a competitive performance compared to full training in the same setup. We experiment with our model on two datasets, E2E NLG and DART. STSM offers the D2T model a generalization capability from its subset memory while reducing training data volume. Ultimately, we anticipate that this paper will contribute to continual learning solutions that adapt to new training data, incorporating it as a form of self-memory in DTG tasks. The curated dataset is publicly available at: https://github.com/hoangthangta/STSM.

Self-training from Self-memory in Data-to-text Generation

TL;DR

Self-training from Self-memory (STSM) tackles data-to-text generation by coaching a D2T model on subsets that include self-memory outputs inferred from trained models and newly added data. A greedy target-optimization and selective filtering via a T2D validator produce compact targets that preserve all source values, enabling competitive performance with only 30% of the data. Across DART and E2E NLG, the best results arise from combining self-memory with new data, while self-training on the T2D model offers limited gains, suggesting a favorable path for continual learning with reduced labeled data. The approach provides a practical, data-efficient CTO-style workflow for DTG, and the authors make their dataset publicly available for replication and further study.

Abstract

This paper introduces a novel training model, self-training from self-memory (STSM) in data-to-text generation (DTG), allowing the model to self-train on subsets, including self-memory as outputs inferred directly from the trained models and/or the new data. The quality of self-memory is validated by two models, data-to-text (D2T) and text-to-data (T2D), by two pre-defined conditions: (1) the appearance of all source values in the outputs of the D2T model and (2) the ability to convert back to source data in the outputs in the T2D model. We utilize a greedy algorithm to generate shorter D2T outputs if they contain all source values. Subsequently, we use the T2D model to confirm that these outputs can capture input relationships by demonstrating their capacity to convert text back into data. With 30% of the dataset, we can train the D2T model with a competitive performance compared to full training in the same setup. We experiment with our model on two datasets, E2E NLG and DART. STSM offers the D2T model a generalization capability from its subset memory while reducing training data volume. Ultimately, we anticipate that this paper will contribute to continual learning solutions that adapt to new training data, incorporating it as a form of self-memory in DTG tasks. The curated dataset is publicly available at: https://github.com/hoangthangta/STSM.
Paper Structure (21 sections, 5 equations, 2 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 5 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: The self-training model. gets a fixed/random subset for first training D2T and T2D models; gets a fixed/random subset for starting to self-train D2T and T2D models; & infer $Y'$ and $X'$; optimizes $Y'$ as $Y"$ and use it to infer $X"$; creates new data, ($X_N, Y_N$) and ($Y_N, X_N$); & self-train D2T and T2D models on new data. The T2D self-training is optional.
  • Figure 2: The data allocation for different training methods. The figure is shown better with colors.