Explicitly Representing Syntax Improves Sentence-to-layout Prediction of Unexpected Situations

Wolf Nuyts; Ruben Cartuyvels; Marie-Francine Moens

Explicitly Representing Syntax Improves Sentence-to-layout Prediction of Unexpected Situations

Wolf Nuyts, Ruben Cartuyvels, Marie-Francine Moens

TL;DR

The paper tackles sentence-to-layout prediction under unexpected situations by comparing explicit versus implicit syntax representations. It introduces USCOCO, a benchmark of grammatically correct captions describing rare object interactions, and a novel contrastive structural loss that aligns constituency-tree positions with visual embeddings. Results show large generalization gains for models that explicitly encode syntax when trained with the structural loss, while implicit-syntax models struggle on USCOCO. The work advances robust, compositional generation by showing how preserving syntactic structure in downstream embeddings improves out-of-distribution reasoning and suggests broader applicability to other structured-generation tasks.

Abstract

Recognizing visual entities in a natural language sentence and arranging them in a 2D spatial layout require a compositional understanding of language and space. This task of layout prediction is valuable in text-to-image synthesis as it allows localized and controlled in-painting of the image. In this comparative study it is shown that we can predict layouts from language representations that implicitly or explicitly encode sentence syntax, if the sentences mention similar entity-relationships to the ones seen during training. To test compositional understanding, we collect a test set of grammatically correct sentences and layouts describing compositions of entities and relations that unlikely have been seen during training. Performance on this test set substantially drops, showing that current models rely on correlations in the training data and have difficulties in understanding the structure of the input sentences. We propose a novel structural loss function that better enforces the syntactic structure of the input sentence and show large performance gains in the task of 2D spatial layout prediction conditioned on text. The loss has the potential to be used in other generation tasks where a tree-like structure underlies the conditioning modality. Code, trained models and the USCOCO evaluation set are available via github.

Explicitly Representing Syntax Improves Sentence-to-layout Prediction of Unexpected Situations

TL;DR

Abstract

Paper Structure (51 sections, 7 equations, 6 figures, 4 tables)

This paper contains 51 sections, 7 equations, 6 figures, 4 tables.

Introduction
Related work
Methods
Task definition
Text encoders $t_\phi$
Explicitly embedding syntax in sentence representations
Baselines that are assumed to implicitly encode syntax
Layout predictors $p_\psi$
Models
Training
The PAR model
The SEQ model
A cross-entropy loss
A combination of regression losses
A cross-entropy loss
...and 36 more sections

Figures (6)

Figure 1: Samples from the USCOCO dataset.
Figure 2: Overview of text-to-layout prediction.
Figure 3: Recall on replaced objects $\text{Re}^\text{repl}$ in USCOCO vs. structural loss $\mathcal{L}_\text{struct}$ weight $\lambda_\text{1}$.
Figure 4: Examples of generated layouts where annotators chose the layout of TG + $\mathcal{L}_\text{struct}$ over the layout of $\texttt{GPT-2}_\text{Bllip}$ (first 2 examples) and vice versa (last example).
Figure 5: Human evaluation of generated layouts by GPT-2Bllip (+ $\mathcal{L}_\text{struct}$) and TG (+ $\mathcal{L}_\text{struct}$) on USCOCO. Annotators choose the best layout between 2 layouts (anonymized and order-randomized).
...and 1 more figures

Explicitly Representing Syntax Improves Sentence-to-layout Prediction of Unexpected Situations

TL;DR

Abstract

Explicitly Representing Syntax Improves Sentence-to-layout Prediction of Unexpected Situations

Authors

TL;DR

Abstract

Table of Contents

Figures (6)