LTSim: Layout Transportation-based Similarity Measure for Evaluating Layout Generation
Mayu Otani, Naoto Inoue, Kotaro Kikuchi, Riku Togashi
TL;DR
The paper tackles the challenge of evaluating generated layouts by introducing LTSim, a layout similarity measure based on optimal transport that enables flexible, cross-category matching and robust comparison across diverse layout differences. It extends to collection-level evaluation via LTSim-MMD, avoiding reliance on dataset-specific feature extractors. The approach addresses key limitations of existing measures (e.g., DocSim, MeanIoU, FID, Max.IoU) by recognizing many-to-many and cross-category alignments, and it demonstrates superior reliability in distinguishing varying degrees of differences and generation quality. Empirical results on RICO and PubLayNet show LTSim provides more reliable and interpretable comparisons across unconditional and label-conditioned generation tasks, while reducing the need for learned representations. The work offers practical guidance for researchers and practitioners seeking robust, scalable layout evaluation tools applicable to UI and document layouts.
Abstract
We introduce a layout similarity measure designed to evaluate the results of layout generation. While several similarity measures have been proposed in prior research, there has been a lack of comprehensive discussion about their behaviors. Our research uncovers that the majority of these measures are unable to handle various layout differences, primarily due to their dependencies on strict element matching, that is one-by-one matching of elements within the same category. To overcome this limitation, we propose a new similarity measure based on optimal transport, which facilitates a more flexible matching of elements. This approach allows us to quantify the similarity between any two layouts even those sharing no element categories, making our measure highly applicable to a wide range of layout generation tasks. For tasks such as unconditional layout generation, where FID is commonly used, we also extend our measure to deal with collection-level similarities between groups of layouts. The empirical result suggests that our collection-level measure offers more reliable comparisons than existing ones like FID and Max.IoU.
