Intertwining CP and NLP: The Generation of Unreasonably Constrained Sentences

Alexandre Bonlarron; Jean-Charles Régin

Intertwining CP and NLP: The Generation of Unreasonably Constrained Sentences

Alexandre Bonlarron, Jean-Charles Régin

TL;DR

Problem: constrained text generation under hard linguistic rules is difficult for both pure NLP and CP approaches. Approach: CPTextGen treats constrained generation as a CSP encoded with Multi-valued Decision Diagrams, integrates n-gram language modeling, and employs an LLM-based perplexity curation step, formalized around the final MDD $MDD_{final}$. Contributions: a generic CP-based framework, an explicit MDD compilation strategy, and a RADNER-like case study showing the generation of highly constrained sentences, with analysis of LLM-based evaluation under unusual constraints. Significance: demonstrates a practical pathway to generate unreasonably constrained text and suggests new benchmarks for LLM robustness via perplexity-based ranking.

Abstract

Constrained text generation remains a challenging task, particularly when dealing with hard constraints. Traditional NLP approaches prioritize generating meaningful and coherent output. Also, the current state-of-the-art methods often lack the expressiveness and constraint satisfaction capabilities to handle such tasks effectively. Recently, an approach for generating constrained sentences in CP has been proposed in (Bonlarron et al, 2023). This ad-hoc model to solve the sentences generation problem under MNREAD rules proved neithertheless to be computationaly and structuraly unsuitable to deal with other more constrained problems. In this paper, a novel more generic approach is introduced to tackle many of these previously untractable problems, and illustrated here with the quite untractable sentences generation problem following RADNER rules. More precisely, this paper presents the CPTextGen Framework. This framework considers a constrained text generation problem as a discrete combinatorial optimization problem. It is solved by a constraint programming method that combines linguistic properties (e.g., n-grams or language level) with other more classical constraints (e.g., the number of characters, syllables). Eventually, a curation phase allows for selecting the best-generated sentences according to perplexity using an LLM. The effectiveness of this approach is demonstrated by tackling a new, more tediously constrained text generation problem: the iconic RADNER sentences problem. This problem aims to generate sentences respecting a set of quite strict rules defined by their use in vision and clinical research. Thanks to our CP-based approach, many new strongly constrained sentences have been successfully generated. This highlights our approach's potential to handle unreasonably constrained text generation scenarios.

Intertwining CP and NLP: The Generation of Unreasonably Constrained Sentences

TL;DR

. Contributions: a generic CP-based framework, an explicit MDD compilation strategy, and a RADNER-like case study showing the generation of highly constrained sentences, with analysis of LLM-based evaluation under unusual constraints. Significance: demonstrates a practical pathway to generate unreasonably constrained text and suggests new benchmarks for LLM robustness via perplexity-based ranking.

Abstract

Paper Structure (35 sections, 7 equations, 5 figures, 4 tables)

This paper contains 35 sections, 7 equations, 5 figures, 4 tables.

Introduction
Preliminaries
Multi-valued Decision Diagram
MDD of a Constraint
Cost-MDD
Constraint Satisfaction Problem
Language Model
N-gram Model
LLM
Perplexity
The CPTextGen Framework
Input Data: N-gram Corpora
Ngram reTRIEval
Constraint Programming Model
MDD Compilation
...and 20 more sections

Figures (5)

Figure 1: Example of MDD representing the set of $x_1 + x_2 + x_3 \in [5,9]$. For each variable $x_i$, the domain $D(.)$ is $D(x_1)$ = $\{1,3,7\}$, $D(x_2)$ = $\{0,2,4\}$, $D(x_3)$ = $\{2,3,4\}$. For example, $(7,0,2)$ belongs to the set of solutions defined by the MDD.
Figure 2: Example of MDDtrie storing 3-grams (successions of 3 words): "The black dog"; "A red pot"... Any path from the root to tt is a valid n-gram. To find the successors of the n-gram "The white dog" (in red), more precisely the following potential words, we start from root to walk along the arcs that contains the labels of the two last arc, i.e., "white" and "dog" (in blue). In that case, one outgoing arc from the node can be reached with "white cat". Thus, the successor of "The white dog" is "likes" (in green).
Figure 3: This figures summarizes the major step of the CPTextGen framework. Once the filtered n-grams are gathered. The MDD data structure acts as a bridge between the constraint satisfaction relying on CP techniques and the n-gram chaining that takes account of the structure of the language.
Figure 4: Example of an English RADNER sentence incorporating relative clauses with three lines and 14 words. A subset of the RADNER rules are highlighted. Each line contains between 27 and 29 characters; also, in purple, several words should be of exactly one syllable. In red, the second word of the second line must contain 10 characters and three syllables.
Figure 5: Number of nodes for each layer of the MDD during the solving. The x-axis is the layer number, and the y-axis is the number of nodes in the layer (also called the width of the layer). Y-axis is in logarithmic scale. N.B.: There is at least one ingoing arc for each node in the MDD (except the root node).

Intertwining CP and NLP: The Generation of Unreasonably Constrained Sentences

TL;DR

Abstract

Intertwining CP and NLP: The Generation of Unreasonably Constrained Sentences

Authors

TL;DR

Abstract

Table of Contents

Figures (5)