Table of Contents
Fetching ...

Spot the Error: Non-autoregressive Graphic Layout Generation with Wireframe Locator

Jieru Lin, Danqing Huang, Tiejun Zhao, Dechen Zhan, Chin-Yew Lin

TL;DR

This work analyzes autoregressive versus non-autoregressive layout generation and proposes a learning-based wireframe locator to improve iterative refinement. A wireframe-conditioned Transformer decoder predicts layout tokens in parallel, while a dedicated locator identifies erroneous tokens from a rendered wireframe input using a Hungarian-matching-based data construction strategy. Experiments on Rico and PubLaynet show the proposed method surpasses AR and prior NAR baselines, with notable gains in PixelFID and related metrics. Ablations reveal pixel-space representations capture spatial patterns more effectively, and cross-attention to the wireframe regions confirms the model’s spatial grounding; the locator and iterative refinement together yield faster convergence and more balanced layouts.

Abstract

Layout generation is a critical step in graphic design to achieve meaningful compositions of elements. Most previous works view it as a sequence generation problem by concatenating element attribute tokens (i.e., category, size, position). So far the autoregressive approach (AR) has achieved promising results, but is still limited in global context modeling and suffers from error propagation since it can only attend to the previously generated tokens. Recent non-autoregressive attempts (NAR) have shown competitive results, which provides a wider context range and the flexibility to refine with iterative decoding. However, current works only use simple heuristics to recognize erroneous tokens for refinement which is inaccurate. This paper first conducts an in-depth analysis to better understand the difference between the AR and NAR framework. Furthermore, based on our observation that pixel space is more sensitive in capturing spatial patterns of graphic layouts (e.g., overlap, alignment), we propose a learning-based locator to detect erroneous tokens which takes the wireframe image rendered from the generated layout sequence as input. We show that it serves as a complementary modality to the element sequence in object space and contributes greatly to the overall performance. Experiments on two public datasets show that our approach outperforms both AR and NAR baselines. Extensive studies further prove the effectiveness of different modules with interesting findings. Our code will be available at https://github.com/ffffatgoose/SpotError.

Spot the Error: Non-autoregressive Graphic Layout Generation with Wireframe Locator

TL;DR

This work analyzes autoregressive versus non-autoregressive layout generation and proposes a learning-based wireframe locator to improve iterative refinement. A wireframe-conditioned Transformer decoder predicts layout tokens in parallel, while a dedicated locator identifies erroneous tokens from a rendered wireframe input using a Hungarian-matching-based data construction strategy. Experiments on Rico and PubLaynet show the proposed method surpasses AR and prior NAR baselines, with notable gains in PixelFID and related metrics. Ablations reveal pixel-space representations capture spatial patterns more effectively, and cross-attention to the wireframe regions confirms the model’s spatial grounding; the locator and iterative refinement together yield faster convergence and more balanced layouts.

Abstract

Layout generation is a critical step in graphic design to achieve meaningful compositions of elements. Most previous works view it as a sequence generation problem by concatenating element attribute tokens (i.e., category, size, position). So far the autoregressive approach (AR) has achieved promising results, but is still limited in global context modeling and suffers from error propagation since it can only attend to the previously generated tokens. Recent non-autoregressive attempts (NAR) have shown competitive results, which provides a wider context range and the flexibility to refine with iterative decoding. However, current works only use simple heuristics to recognize erroneous tokens for refinement which is inaccurate. This paper first conducts an in-depth analysis to better understand the difference between the AR and NAR framework. Furthermore, based on our observation that pixel space is more sensitive in capturing spatial patterns of graphic layouts (e.g., overlap, alignment), we propose a learning-based locator to detect erroneous tokens which takes the wireframe image rendered from the generated layout sequence as input. We show that it serves as a complementary modality to the element sequence in object space and contributes greatly to the overall performance. Experiments on two public datasets show that our approach outperforms both AR and NAR baselines. Extensive studies further prove the effectiveness of different modules with interesting findings. Our code will be available at https://github.com/ffffatgoose/SpotError.
Paper Structure (25 sections, 9 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 25 sections, 9 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of (a) AR and (b) iterative-based NAR approach in layout generation. The AR approach generates one token at a time conditioned on previously generated tokens. NAR generates all tokens simultaneously and uses a locator (usually heuristics) to detect erroneous tokens which will be masked and re-predict in the next decoding iteration.
  • Figure 2: Overview of our pipeline. Our model consists of a decoder and a locator. For each mask-predict iteration, the locator detects the erroneous attribute tokens to be masked and the decoder predict the masked tokens in a non-autoregressive approach.
  • Figure 3: Comparison of AR (LayoutTransformer) and NAR (BLT) approaches using different input element orders (i.e., position, category). Smaller overlap degree indicates better performance.
  • Figure 4: Qualitative results on Publaynet and Rico.
  • Figure 5: Decoder-Wireframe cross-attn visualization.
  • ...and 1 more figures