Table of Contents
Fetching ...

Unnatural Language Processing: Bridging the Gap Between Synthetic and Natural Language Data

Alana Marzoev, Samuel Madden, M. Frans Kaashoek, Michael Cafarella, Jacob Andreas

TL;DR

This work tackles data scarcity in NLP by proposing a simulated-to-real framework that uses a handcrafted synthetic grammar to generate large-scale labeled data, trains a model on synthetic utterances, and employs embedding-based projections to reinterpret natural language inputs within the synthetic space. The authors introduce a spectrum of techniques—paraphrase search, SimHash-based amortized inference, hierarchical projection, and tunable matching scores—to make projection scalable and accurate across complex grammars. Empirical results on semantic parsing (Overnight) and instruction following (BabyAI) show that synthetic-data training with projection can match or exceed fully supervised baselines in several domains and often outperform standard baselines trained on natural data. The work demonstrates a practical path for building grounded NLP systems with reduced annotation effort, and provides a software-and-data release to encourage further progress in sim-to-real transfer.

Abstract

Large, human-annotated datasets are central to the development of natural language processing models. Collecting these datasets can be the most challenging part of the development process. We address this problem by introducing a general purpose technique for ``simulation-to-real'' transfer in language understanding problems with a delimited set of target behaviors, making it possible to develop models that can interpret natural utterances without natural training data. We begin with a synthetic data generation procedure, and train a model that can accurately interpret utterances produced by the data generator. To generalize to natural utterances, we automatically find projections of natural language utterances onto the support of the synthetic language, using learned sentence embeddings to define a distance metric. With only synthetic training data, our approach matches or outperforms state-of-the-art models trained on natural language data in several domains. These results suggest that simulation-to-real transfer is a practical framework for developing NLP applications, and that improved models for transfer might provide wide-ranging improvements in downstream tasks.

Unnatural Language Processing: Bridging the Gap Between Synthetic and Natural Language Data

TL;DR

This work tackles data scarcity in NLP by proposing a simulated-to-real framework that uses a handcrafted synthetic grammar to generate large-scale labeled data, trains a model on synthetic utterances, and employs embedding-based projections to reinterpret natural language inputs within the synthetic space. The authors introduce a spectrum of techniques—paraphrase search, SimHash-based amortized inference, hierarchical projection, and tunable matching scores—to make projection scalable and accurate across complex grammars. Empirical results on semantic parsing (Overnight) and instruction following (BabyAI) show that synthetic-data training with projection can match or exceed fully supervised baselines in several domains and often outperform standard baselines trained on natural data. The work demonstrates a practical path for building grounded NLP systems with reduced annotation effort, and provides a software-and-data release to encourage further progress in sim-to-real transfer.

Abstract

Large, human-annotated datasets are central to the development of natural language processing models. Collecting these datasets can be the most challenging part of the development process. We address this problem by introducing a general purpose technique for ``simulation-to-real'' transfer in language understanding problems with a delimited set of target behaviors, making it possible to develop models that can interpret natural utterances without natural training data. We begin with a synthetic data generation procedure, and train a model that can accurately interpret utterances produced by the data generator. To generalize to natural utterances, we automatically find projections of natural language utterances onto the support of the synthetic language, using learned sentence embeddings to define a distance metric. With only synthetic training data, our approach matches or outperforms state-of-the-art models trained on natural language data in several domains. These results suggest that simulation-to-real transfer is a practical framework for developing NLP applications, and that improved models for transfer might provide wide-ranging improvements in downstream tasks.

Paper Structure

This paper contains 20 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of our approach to synthetic-to-real transfer in language understanding tasks. (a) Training examples are generated from a synthetic data generation procedure that covers desired model behaviors but a limited range of linguistic variation. (b) This data is used to train a model that can correctly interpret synthetic utterances. (c) Separately, sentence representations are learned using a masked language modeling scheme like bert. (d) To interpret human-generated model inputs from a broader distribution, we first project onto the set of sentences reachable by the synthetic data generation procedure, and then interpret the projected sentence with the trained model.
  • Figure 2: Hierarchical projection. Given a natural language input, we search for a high-scoring utterance generated by a fixed CFG, using similarity between sentences and partial derivations as a search heuristic. An additional heuristic scores noun phrases locally by measuring their similarity with noun chunks extracted from the input sentence.
  • Figure 3: Example sentences from the calendar and recipeOvernight domains. real is a human-generated utterance, synth is a synthetic utterance from the domain grammar of calendar, and LF is the target logical expression.
  • Figure 4: Example from the BabyAI dataset. Agents are given tasks with language-like specifications, and must execute a sequence of low-level actions in the environment to complete them. We augment this dataset with a set of human instructions, and evaluate generalization of agents trained only on synthetic goal specifications to novel human requests.
  • Figure :
  • ...and 3 more figures