Table of Contents
Fetching ...

Linguistically-Controlled Paraphrase Generation

Mohamed Elgaar, Hadi Amiri

TL;DR

This work tackles controlled paraphrase generation by introducing LingConv, an encoder-decoder framework that supports fine-grained control over 40 linguistic attributes. It couples a linguistic attribute predictor and a semantic equivalence classifier with a novel inference-time quality-control loop that iteratively refines outputs to align with target attributes while preserving meaning, via decoder-side attribute injection and MICE-based imputation. Empirical results show LingConv outperforms strong baselines in attribute adherence (up to 34% improvement) with an additional 14% gain from quality control, and demonstrates practical utility in data augmentation for downstream tasks. The paper also introduces a Novel Target Challenge to test adaptability to unseen attribute combinations and discusses ethical considerations and potential extensions to multilingual settings.

Abstract

Controlled paraphrase generation produces paraphrases that preserve meaning while allowing precise control over linguistic attributes of the output. We introduce LingConv, an encoder-decoder framework that enables fine-grained control over 40 linguistic attributes in English. To improve reliability, we introduce a novel inference-time quality control mechanism that iteratively refines attribute embeddings to generate paraphrases that closely match target attributes without sacrificing semantic fidelity. LingConv reduces attribute error by up to 34% over existing models, with the quality control mechanism contributing an additional 14% improvement.

Linguistically-Controlled Paraphrase Generation

TL;DR

This work tackles controlled paraphrase generation by introducing LingConv, an encoder-decoder framework that supports fine-grained control over 40 linguistic attributes. It couples a linguistic attribute predictor and a semantic equivalence classifier with a novel inference-time quality-control loop that iteratively refines outputs to align with target attributes while preserving meaning, via decoder-side attribute injection and MICE-based imputation. Empirical results show LingConv outperforms strong baselines in attribute adherence (up to 34% improvement) with an additional 14% gain from quality control, and demonstrates practical utility in data augmentation for downstream tasks. The paper also introduces a Novel Target Challenge to test adaptability to unseen attribute combinations and discusses ethical considerations and potential extensions to multilingual settings.

Abstract

Controlled paraphrase generation produces paraphrases that preserve meaning while allowing precise control over linguistic attributes of the output. We introduce LingConv, an encoder-decoder framework that enables fine-grained control over 40 linguistic attributes in English. To improve reliability, we introduce a novel inference-time quality control mechanism that iteratively refines attribute embeddings to generate paraphrases that closely match target attributes without sacrificing semantic fidelity. LingConv reduces attribute error by up to 34% over existing models, with the quality control mechanism contributing an additional 14% improvement.

Paper Structure

This paper contains 42 sections, 4 equations, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: We aim to transform a given sentence into multiple paraphrases, each satisfying distinct linguistic attributes. Our model takes a source sentence and a set of target linguistic attributes and generates a paraphrase optimized to satisfy the target attributes. Here we show three paraphrases with different linguistic attributes generated for the source sentence. Linguistic features identified using the spaCy "en_core_web_sm", with stop-word list from spacy_stopwords.
  • Figure 2: LingConv Architecture: The paraphrase generator extends the T5 model by incorporating linguistic attributes into the decoder inputs. Linguistic attributes of the source ($\mathbf{l^s}$) and target ($\mathbf{l}^t$) are embedded and fused with the generation using element-wise addition to the decoder inputs. In addition, the linguistic attribute predictor estimates attributes of the generated text, which facilitates backpropagation of the linguistic attribute error. During inference, the quality control mechanism iteratively adjusts inputs to guide outputs towards desired attributes. The Semantic Equivalence Predictor (SE) receives as input the source sentence and the candidate generation $\hat{t}$, as in Algorithm \ref{['alg:qc']} (line 25), to assess semantic similarity. The model is trained with a dual objective of semantic equivalence and linguistic attribute adherence.
  • Figure 3: Qualitative comparison of LingConv and QCPG. For each attribute, the target, each model's value, and the error magnitude are shown. Large errors are bolded.
  • Figure 4: Attribute distributions for effective vs. ineffective augmentation on the RTE (Limited) dataset. Effective augmentation has a greater percentage of shorter sentences.
  • Figure 5: For CoLA (Limited), effective augmentation is associated with increased percentage of sentences with ratio of unique verbs $> 0.7$.
  • ...and 4 more figures