Table of Contents
Fetching ...

Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

Tianhui Zhang, Bei Peng, Danushka Bollegala

Abstract

Conversational agents are required to respond to their users not only with high quality (i.e. commonsense bearing) responses, but also considering multiple plausible alternative scenarios, reflecting the diversity in their responses. Despite the growing need to train diverse commonsense generators, the progress of this line of work has been significantly hindered by the lack of large-scale high-quality diverse commonsense training datasets. Due to the high annotation costs, existing Generative Commonsense Reasoning (GCR) datasets are created using a small number of human annotators, covering only a narrow set of commonsense scenarios. To address this training resource gap, we propose a two-stage method to create the first-ever synthetic dataset CommonSyn for diversified (GCR). The model fine-tuned on our synthetic data jointly increase both generation diversity and quality compared with vanilla models and the model fine-tuned on human-crafted dataset across different size Large Language Models (LLMs)

Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

Abstract

Conversational agents are required to respond to their users not only with high quality (i.e. commonsense bearing) responses, but also considering multiple plausible alternative scenarios, reflecting the diversity in their responses. Despite the growing need to train diverse commonsense generators, the progress of this line of work has been significantly hindered by the lack of large-scale high-quality diverse commonsense training datasets. Due to the high annotation costs, existing Generative Commonsense Reasoning (GCR) datasets are created using a small number of human annotators, covering only a narrow set of commonsense scenarios. To address this training resource gap, we propose a two-stage method to create the first-ever synthetic dataset CommonSyn for diversified (GCR). The model fine-tuned on our synthetic data jointly increase both generation diversity and quality compared with vanilla models and the model fine-tuned on human-crafted dataset across different size Large Language Models (LLMs)
Paper Structure (37 sections, 3 equations, 7 figures, 14 tables)

This paper contains 37 sections, 3 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Quality--Diversity trade-off for representative models. The x-axis represents the semantic diversity (Self-CosSim, $\uparrow$) and the y-axis represents the generation quality (Overall, $\uparrow$). While vanilla models (hollow markers) often suffer from either low quality or limited diversity, its fine-tuned version on our synthetic data, CommonSyn (solid markers), consistently pushes the performance frontier towards the top-right quadrant, achieving a superior Pareto improvement across diverse model families.
  • Figure 2: Trade-off between Quality and Diversity. The dashed line indicates the Pareto frontier. CommonSyn (Red Star) strictly outperforms the gradient-hybrid method Q&G-based (Brown Cross) by achieving higher quality at the same diversity level.
  • Figure 3: Evaluation prompts used by GPT-4o to judge model generations in pairwise comparison. Each prompt defines task-specific criteria for selecting the better output between the model output and Human reference.
  • Figure 4: Prompt template used to expand 2-seed concept sets during synthetic data generation. This instruction guides an LLM to add contextually relevant keywords that enable the construction of a single coherent, everyday scenario.
  • Figure 5: Prompts used to generate synthetic sentences from expanded concept sets. The top prompt is shared by both dynamic few-shot ($D_{\text{dyn}}$) and multi-sentence few-shot ($D_{\text{ms}}$) strategies; the bottom prompt enforces explicit reasoning via chain-of-thought ($D_{\text{cot}}$). All outputs are constrained to $\leq 22$ words and filtered for full keyword coverage.
  • ...and 2 more figures