Table of Contents
Fetching ...

Neural paraphrasing by automatically crawled and aligned sentence pairs

Achille Globo, Antonio Trevisi, Andrea Zugarini, Leonardo Rigutini, Marco Maggini, Stefano Melacci

TL;DR

The paper tackles paraphrase generation by addressing the lack of large, aligned paraphrase datasets. It introduces an automatic dataset construction pipeline that crawls Italian news and blogs, annotates content with rich linguistic features, and uses Highly Constrained Sentence Similarity Search to extract (input, paraphrase) pairs. A neural paraphrasing model based on Pointer networks with a copy mechanism is trained on the generated data, showing improvements over a plain sequence-to-sequence baseline in ROUGE metrics and highlighting the feasibility of scalable paraphrase data creation. The work demonstrates a practical approach for Italian and outlines avenues to extend to English and incorporate knowledge-driven constraints.

Abstract

Paraphrasing is the task of re-writing an input text using other words, without altering the meaning of the original content. Conversational systems can exploit automatic paraphrasing to make the conversation more natural, e.g., talking about a certain topic using different paraphrases in different time instants. Recently, the task of automatically generating paraphrases has been approached in the context of Natural Language Generation (NLG). While many existing systems simply consist in rule-based models, the recent success of the Deep Neural Networks in several NLG tasks naturally suggests the possibility of exploiting such networks for generating paraphrases. However, the main obstacle toward neural-network-based paraphrasing is the lack of large datasets with aligned pairs of sentences and paraphrases, that are needed to efficiently train the neural models. In this paper we present a method for the automatic generation of large aligned corpora, that is based on the assumption that news and blog websites talk about the same events using different narrative styles. We propose a similarity search procedure with linguistic constraints that, given a reference sentence, is able to locate the most similar candidate paraphrases out from millions of indexed sentences. The data generation process is evaluated in the case of the Italian language, performing experiments using pointer-based deep neural architectures.

Neural paraphrasing by automatically crawled and aligned sentence pairs

TL;DR

The paper tackles paraphrase generation by addressing the lack of large, aligned paraphrase datasets. It introduces an automatic dataset construction pipeline that crawls Italian news and blogs, annotates content with rich linguistic features, and uses Highly Constrained Sentence Similarity Search to extract (input, paraphrase) pairs. A neural paraphrasing model based on Pointer networks with a copy mechanism is trained on the generated data, showing improvements over a plain sequence-to-sequence baseline in ROUGE metrics and highlighting the feasibility of scalable paraphrase data creation. The work demonstrates a practical approach for Italian and outlines avenues to extend to English and incorporate knowledge-driven constraints.

Abstract

Paraphrasing is the task of re-writing an input text using other words, without altering the meaning of the original content. Conversational systems can exploit automatic paraphrasing to make the conversation more natural, e.g., talking about a certain topic using different paraphrases in different time instants. Recently, the task of automatically generating paraphrases has been approached in the context of Natural Language Generation (NLG). While many existing systems simply consist in rule-based models, the recent success of the Deep Neural Networks in several NLG tasks naturally suggests the possibility of exploiting such networks for generating paraphrases. However, the main obstacle toward neural-network-based paraphrasing is the lack of large datasets with aligned pairs of sentences and paraphrases, that are needed to efficiently train the neural models. In this paper we present a method for the automatic generation of large aligned corpora, that is based on the assumption that news and blog websites talk about the same events using different narrative styles. We propose a similarity search procedure with linguistic constraints that, given a reference sentence, is able to locate the most similar candidate paraphrases out from millions of indexed sentences. The data generation process is evaluated in the case of the Italian language, performing experiments using pointer-based deep neural architectures.
Paper Structure (15 sections, 8 equations, 2 figures, 1 table, 2 algorithms)

This paper contains 15 sections, 8 equations, 2 figures, 1 table, 2 algorithms.

Figures (2)

  • Figure 1: Examples of aligned pairs automatically generated with the proposed method (Italian).
  • Figure 2: Example of a common issue in the considered neural models. Generated sentences are sometimes too similar to the input sentence.