Table of Contents
Fetching ...

Smaller But Better: Unifying Layout Generation with Smaller Large Language Models

Peirong Zhang, Jiaxin Zhang, Jiahuan Cao, Hongliang Li, Lianwen Jin

TL;DR

This work introduces LGGPT, a decoder-only LLM designed for unified layout generation across multiple tasks and domains by adopting Arbitrary Layout Instruction (ALI) and Universal Layout Response (ULR) as a compact, uniform input/output template and Interval Quantization Encoding (IQE) to preserve layout cues without placeholders. By training a $1.5\mathrm{B}$ GPT2-XL backbone with heterogeneous, instruction-tuned data across four domains and five datasets, LGGPT achieves competitive or state-of-the-art results on isolated tasks while maintaining superior efficiency relative to much larger models. The authors demonstrate the necessity of LLMs for unified layout generation, show that IQE and ALI substantially boost performance, and identify 1.5B as a practical sweet spot balancing proficiency and computational cost. Collectively, the work provides a practical, scalable path toward task- and domain-generic layout generation with potential for extension to broader layout types and text-conditioned scenarios.

Abstract

We propose LGGPT, an LLM-based model tailored for unified layout generation. First, we propose Arbitrary Layout Instruction (ALI) and Universal Layout Response (ULR) as the uniform I/O template. ALI accommodates arbitrary layout generation task inputs across multiple layout domains, enabling LGGPT to unify both task-generic and domain-generic layout generation hitherto unexplored. Collectively, ALI and ULR boast a succinct structure that forgoes superfluous tokens typically found in existing HTML-based formats, facilitating efficient instruction tuning and boosting unified generation performance. In addition, we propose an Interval Quantization Encoding (IQE) strategy that compresses ALI into a more condensed structure. IQE precisely preserves valid layout clues while eliminating the less informative placeholders, facilitating LGGPT to capture complex and variable layout generation conditions during the unified training process. Experimental results demonstrate that LGGPT achieves superior or on par performance compared to existing methods. Notably, LGGPT strikes a prominent balance between proficiency and efficiency with a compact 1.5B parameter LLM, which beats prior 7B or 175B models even in the most extensive and challenging unified scenario. Furthermore, we underscore the necessity of employing LLMs for unified layout generation and suggest that 1.5B could be an optimal parameter size by comparing LLMs of varying scales. Code is available at https://github.com/NiceRingNode/LGGPT.

Smaller But Better: Unifying Layout Generation with Smaller Large Language Models

TL;DR

This work introduces LGGPT, a decoder-only LLM designed for unified layout generation across multiple tasks and domains by adopting Arbitrary Layout Instruction (ALI) and Universal Layout Response (ULR) as a compact, uniform input/output template and Interval Quantization Encoding (IQE) to preserve layout cues without placeholders. By training a GPT2-XL backbone with heterogeneous, instruction-tuned data across four domains and five datasets, LGGPT achieves competitive or state-of-the-art results on isolated tasks while maintaining superior efficiency relative to much larger models. The authors demonstrate the necessity of LLMs for unified layout generation, show that IQE and ALI substantially boost performance, and identify 1.5B as a practical sweet spot balancing proficiency and computational cost. Collectively, the work provides a practical, scalable path toward task- and domain-generic layout generation with potential for extension to broader layout types and text-conditioned scenarios.

Abstract

We propose LGGPT, an LLM-based model tailored for unified layout generation. First, we propose Arbitrary Layout Instruction (ALI) and Universal Layout Response (ULR) as the uniform I/O template. ALI accommodates arbitrary layout generation task inputs across multiple layout domains, enabling LGGPT to unify both task-generic and domain-generic layout generation hitherto unexplored. Collectively, ALI and ULR boast a succinct structure that forgoes superfluous tokens typically found in existing HTML-based formats, facilitating efficient instruction tuning and boosting unified generation performance. In addition, we propose an Interval Quantization Encoding (IQE) strategy that compresses ALI into a more condensed structure. IQE precisely preserves valid layout clues while eliminating the less informative placeholders, facilitating LGGPT to capture complex and variable layout generation conditions during the unified training process. Experimental results demonstrate that LGGPT achieves superior or on par performance compared to existing methods. Notably, LGGPT strikes a prominent balance between proficiency and efficiency with a compact 1.5B parameter LLM, which beats prior 7B or 175B models even in the most extensive and challenging unified scenario. Furthermore, we underscore the necessity of employing LLMs for unified layout generation and suggest that 1.5B could be an optimal parameter size by comparing LLMs of varying scales. Code is available at https://github.com/NiceRingNode/LGGPT.

Paper Structure

This paper contains 32 sections, 2 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Examples of different types of layout.
  • Figure 2: Overall architecture of LGGPT, which mainly consists of Arbitrary Layout Instruction (ALI), Universal Layout Response (ULR), the Interval Quantization Encoding (IQE) strategy, and a unified LLM. ALI is utilized for instruction tuning on the LLM, which consolidates a designated prompt for layout type and random layout conditions through Arbitrary Layout Condition Sequence. IQE is proposed to compress ALI for a more condensed structure. ULR requires the LLM always to generate a complete, precise layout given arbitrary layout inputs.
  • Figure 3: The visualized demonstration of the Arbitrary Layout Condition Sequence, which is the key component of ALI that accounts for the "arbitrary" property. It accommodates arbitrary layout conditions by supporting the infinite combination of known, unknown, and noisy attributes, therefore covering all possible layout generation tasks.
  • Figure 4: The schematic of Interval Quantization Encoding. It applies positional encoding to the $x,y,w,h$ attributes of each layout element by adding independent interval values. This enables the model to distinguish them solely according to the numerical magnitudes. Then we can skip the unknown attributes and concatenate other attributes to compress the layout sequence, avoiding the usage of conventional placeholders and significantly increasing the valid information density of input instructions. We present an example comparison of using placeholder and IQE at the bottom.
  • Figure 5: A detailed demonstration of the evaluated tasks and the corresponding instructions. $x^{no}_i$ denotes the noisy attribute. The ground-truth response represents the complete layout.
  • ...and 5 more figures