Table of Contents
Fetching ...

FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow

Haoyu Sun, Huichen Will Wang, Jiawei Gu, Linjie Li, Yu Cheng

TL;DR

FullFront addresses the need to evaluate Multimodal Large Language Models across the entire front-end engineering workflow. It introduces a two-stage HTML synthesis pipeline to transform real-world webpages into clean, standardized HTML and defines three core tasks: Webpage Design, Webpage Perception QA, and Webpage Code Generation, all under robust visual- and code-level evaluation. The study reveals substantial gaps between current MLLMs and human performance, particularly in perception fidelity, image handling, layout accuracy, and interactive code generation, with proprietary models generally outperforming open-source ones. By providing a unified benchmark, a Mini dataset for rapid testing, and public code, FullFront offers a practical framework to drive development toward end-to-end intelligent webpage development tools.

Abstract

Front-end engineering involves a complex workflow where engineers conceptualize designs, translate them into code, and iteratively refine the implementation. While recent benchmarks primarily focus on converting visual designs to code, we present FullFront, a benchmark designed to evaluate Multimodal Large Language Models (MLLMs) \textbf{across the full front-end development pipeline}. FullFront assesses three fundamental tasks that map directly to the front-end engineering pipeline: Webpage Design (conceptualization phase), Webpage Perception QA (comprehension of visual organization and elements), and Webpage Code Generation (implementation phase). Unlike existing benchmarks that use either scraped websites with bloated code or oversimplified LLM-generated HTML, FullFront employs a novel, two-stage process to transform real-world webpages into clean, standardized HTML while maintaining diverse visual designs and avoiding copyright issues. Extensive testing of state-of-the-art MLLMs reveals significant limitations in page perception, code generation (particularly for image handling and layout), and interaction implementation. Our results quantitatively demonstrate performance disparities across models and tasks, and highlight a substantial gap between current MLLM capabilities and human expert performance in front-end engineering. The FullFront benchmark and code are available in https://github.com/Mikivishy/FullFront.

FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow

TL;DR

FullFront addresses the need to evaluate Multimodal Large Language Models across the entire front-end engineering workflow. It introduces a two-stage HTML synthesis pipeline to transform real-world webpages into clean, standardized HTML and defines three core tasks: Webpage Design, Webpage Perception QA, and Webpage Code Generation, all under robust visual- and code-level evaluation. The study reveals substantial gaps between current MLLMs and human performance, particularly in perception fidelity, image handling, layout accuracy, and interactive code generation, with proprietary models generally outperforming open-source ones. By providing a unified benchmark, a Mini dataset for rapid testing, and public code, FullFront offers a practical framework to drive development toward end-to-end intelligent webpage development tools.

Abstract

Front-end engineering involves a complex workflow where engineers conceptualize designs, translate them into code, and iteratively refine the implementation. While recent benchmarks primarily focus on converting visual designs to code, we present FullFront, a benchmark designed to evaluate Multimodal Large Language Models (MLLMs) \textbf{across the full front-end development pipeline}. FullFront assesses three fundamental tasks that map directly to the front-end engineering pipeline: Webpage Design (conceptualization phase), Webpage Perception QA (comprehension of visual organization and elements), and Webpage Code Generation (implementation phase). Unlike existing benchmarks that use either scraped websites with bloated code or oversimplified LLM-generated HTML, FullFront employs a novel, two-stage process to transform real-world webpages into clean, standardized HTML while maintaining diverse visual designs and avoiding copyright issues. Extensive testing of state-of-the-art MLLMs reveals significant limitations in page perception, code generation (particularly for image handling and layout), and interaction implementation. Our results quantitatively demonstrate performance disparities across models and tasks, and highlight a substantial gap between current MLLM capabilities and human expert performance in front-end engineering. The FullFront benchmark and code are available in https://github.com/Mikivishy/FullFront.

Paper Structure

This paper contains 45 sections, 4 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Overview of the eight subtasks FullFront covers and our data construction pipeline.
  • Figure 2: Comparison of the images used in our FullFront for webpage code generation tasks with those of other benchmarks. We are the first to not use a single image placeholder or random images.
  • Figure 3: MLLM Errors in Webpage Perception QA. (a) Distribution of error types for 200 questions. (b) An illustrative example of a Positioning Error. (c) An illustrative example of a Size Error.
  • Figure 4: Three common errors in Webpage Code Generation. (a) Abnormal Image Sizes, where an image within the rendered page is disproportionately large. (b) Blank Pages, showing an entirely blank rendered output. (c) Isolation Error, demonstrating an output consisting only of an isolated interactive element.
  • Figure 5: Human evaluation comparing MLLMs-generated and Real-World webpages.
  • ...and 4 more figures