Table of Contents
Fetching ...

Benchmarking MLLM-based Web Understanding: Reasoning, Robustness and Safety

Junliang Liu, Jingyu Xiao, Wenxin Tang, Zhixian Wang, Zipeng Xie, Wenxuan Wang, Minrui Zhang, Shuanghe Yu

TL;DR

A comprehensive web understanding benchmark, named WebRRSBench, that jointly evaluates Reasoning, Robustness, and Safety across eight tasks, such as position relationship reasoning, color robustness, and safety critical detection, etc.

Abstract

Multimodal large language models (MLLMs) are increasingly deployed as the core reasoning engine for web-facing systems, powering GUI agents and front-end automation that must interpret page structure, select actionable widgets, and execute multi-step interactions reliably. However, existing benchmarks largely emphasize visual perception or UI code generation, showing insufficient evaluation on the reasoning, robustness and safety capability required for end-to-end web applications. To bridge the gap, we introduce a comprehensive web understanding benchmark, named WebRRSBench, that jointly evaluates Reasoning, Robustness, and Safety across eight tasks, such as position relationship reasoning, color robustness, and safety critical detection, etc. The benchmark is constructed from 729 websites and contains 3799 QA pairs that probe multi-step inference over page structure, text, widgets, and safety-critical interactions. To ensure reliable measurement, we adopt standardized prompts, a protocolized and deterministic evaluation pipeline, and multi-stage quality control combining automatic checks with targeted human verification. We evaluate 11 MLLMs on WebRRSBench. The results reveal significant gaps: models still struggle with compositional and cross-element reasoning over realistic layouts, show limited robustness when facing perturbations in user interfaces and content such as layout rearrangements or visual style shifts, and are rather conservative in recognizing and avoiding safety critical or irreversible actions. Our code and appendix are available at https: //github.com/annoy-worker/WebRSSBench.

Benchmarking MLLM-based Web Understanding: Reasoning, Robustness and Safety

TL;DR

A comprehensive web understanding benchmark, named WebRRSBench, that jointly evaluates Reasoning, Robustness, and Safety across eight tasks, such as position relationship reasoning, color robustness, and safety critical detection, etc.

Abstract

Multimodal large language models (MLLMs) are increasingly deployed as the core reasoning engine for web-facing systems, powering GUI agents and front-end automation that must interpret page structure, select actionable widgets, and execute multi-step interactions reliably. However, existing benchmarks largely emphasize visual perception or UI code generation, showing insufficient evaluation on the reasoning, robustness and safety capability required for end-to-end web applications. To bridge the gap, we introduce a comprehensive web understanding benchmark, named WebRRSBench, that jointly evaluates Reasoning, Robustness, and Safety across eight tasks, such as position relationship reasoning, color robustness, and safety critical detection, etc. The benchmark is constructed from 729 websites and contains 3799 QA pairs that probe multi-step inference over page structure, text, widgets, and safety-critical interactions. To ensure reliable measurement, we adopt standardized prompts, a protocolized and deterministic evaluation pipeline, and multi-stage quality control combining automatic checks with targeted human verification. We evaluate 11 MLLMs on WebRRSBench. The results reveal significant gaps: models still struggle with compositional and cross-element reasoning over realistic layouts, show limited robustness when facing perturbations in user interfaces and content such as layout rearrangements or visual style shifts, and are rather conservative in recognizing and avoiding safety critical or irreversible actions. Our code and appendix are available at https: //github.com/annoy-worker/WebRSSBench.

Paper Structure

This paper contains 17 sections, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Evaluation task and dimension in WebRRSBench.
  • Figure 3: Task and evalution pipeline in WebRRSBench.
  • Figure 4: Examples of color perturbation mechanisms for evaluating model robustness. (a) Global low-contrast: uniform contrast reduction across the entire screenshot. (b) Partial button chroma: color modification applied to a randomly selected button. (c) All button chroma: color shifts applied to all button elements on the page.
  • Figure 5: Output examples of MLLMs on WebRRSBench when facing different perturbations.