Table of Contents
Fetching ...

MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

Hyunjun Kim, Sejong Kim

TL;DR

MacroBench investigates whether LLMs can synthesize executable web-automation macros from natural-language goals using a code-first paradigm with Python+Selenium. It employs a controlled, self-hosted seven-site web ecosystem and a taxonomy of 681 tasks to enable end-to-end evaluation, including static checks, sandboxed execution, DOM/database verification, and a safety suite for dual-use risks. Results show strong performance on simple tasks across GPT-4o-Mini, GPT-4o, Gemini, and DeepSeek, but complex, planner-heavy workflows are largely unsolved and none achieve production-grade code quality, highlighting gaps in reasoning, synchronization, and robustness. The work provides a reproducible benchmark and safety-oriented artifacts to guide future improvements in macro synthesis and macro-aware safety for automation systems.

Abstract

We introduce MacroBench, a code-first benchmark that evaluates whether LLMs can synthesize reusable browser-automation programs (macros) from natural-language goals by reading HTML/DOM and emitting Selenium. MacroBench instantiates seven self-hosted sites covering 681 tasks across interaction complexity and targeting difficulty. Our end-to-end protocol validates generated code via static checks, sandboxed execution, and outcome verification (DOM assertions, database snapshots), and includes a safety suite for scraping, spam/abuse, and credential/privacy prompts. Across 2,636 model-task runs, we observe stratified success: GPT-4o-mini (96.8%), GPT-4o (95.3%), Gemini (89.0%), DeepSeek (83.4%). Models handle simple tasks reliably (91.7%) but fail on complex workflows (0.0%), and none meet production-quality coding practices despite functional completion. We release our complete benchmark pipeline, evaluation framework, and experimental results at https://github.com/hyunjun1121/MacroBench to enable reproducible assessment of macro synthesis for web automation.

MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models

TL;DR

MacroBench investigates whether LLMs can synthesize executable web-automation macros from natural-language goals using a code-first paradigm with Python+Selenium. It employs a controlled, self-hosted seven-site web ecosystem and a taxonomy of 681 tasks to enable end-to-end evaluation, including static checks, sandboxed execution, DOM/database verification, and a safety suite for dual-use risks. Results show strong performance on simple tasks across GPT-4o-Mini, GPT-4o, Gemini, and DeepSeek, but complex, planner-heavy workflows are largely unsolved and none achieve production-grade code quality, highlighting gaps in reasoning, synchronization, and robustness. The work provides a reproducible benchmark and safety-oriented artifacts to guide future improvements in macro synthesis and macro-aware safety for automation systems.

Abstract

We introduce MacroBench, a code-first benchmark that evaluates whether LLMs can synthesize reusable browser-automation programs (macros) from natural-language goals by reading HTML/DOM and emitting Selenium. MacroBench instantiates seven self-hosted sites covering 681 tasks across interaction complexity and targeting difficulty. Our end-to-end protocol validates generated code via static checks, sandboxed execution, and outcome verification (DOM assertions, database snapshots), and includes a safety suite for scraping, spam/abuse, and credential/privacy prompts. Across 2,636 model-task runs, we observe stratified success: GPT-4o-mini (96.8%), GPT-4o (95.3%), Gemini (89.0%), DeepSeek (83.4%). Models handle simple tasks reliably (91.7%) but fail on complex workflows (0.0%), and none meet production-quality coding practices despite functional completion. We release our complete benchmark pipeline, evaluation framework, and experimental results at https://github.com/hyunjun1121/MacroBench to enable reproducible assessment of macro synthesis for web automation.

Paper Structure

This paper contains 69 sections, 3 tables.