See and Remember: A Multimodal Agent for Web Traversal

Xinjun Wang; Shengyao Wang; Aimin Zhou; Hao Hao

See and Remember: A Multimodal Agent for Web Traversal

Xinjun Wang, Shengyao Wang, Aimin Zhou, Hao Hao

TL;DR

This paper proposes generally applicable V-GEMS, a robust multimodal agent architecture designed for precise and resilient web traversal that integrates visual grounding to resolve ambiguous interactive elements and introduces an explicit memory stack with state tracking.

Abstract

Autonomous web navigation requires agents to perceive complex visual environments and maintain long-term context, yet current Large Language Model (LLM) based agents often struggle with spatial disorientation and navigation loops. In this paper, we propose generally applicable V-GEMS(Visual Grounding and Explicit Memory System), a robust multimodal agent architecture designed for precise and resilient web traversal. Our agent integrates visual grounding to resolve ambiguous interactive elements and introduces an explicit memory stack with state tracking. This dual mechanism allows the agent to maintain a structured map of its traversal path, enabling valid backtracking and preventing cyclical failures in deep navigation tasks. We also introduce an updatable dynamic benchmark to rigorously evaluate adaptability. Experiments show V-GEMS significantly dominates the WebWalker baseline, achieving a substantial 28.7% performance gain. Code is available at https://github.com/Vaultttttttttttt/V-GEMS.

See and Remember: A Multimodal Agent for Web Traversal

TL;DR

Abstract

Paper Structure (47 sections, 5 equations, 8 figures, 5 tables)

This paper contains 47 sections, 5 equations, 8 figures, 5 tables.

Introduction
Related Work
Web Automation and Multimodality
Web Navigation Agents
Web Traversal Benchmarks
Proposed Method
Symbolic Counter with Adaptive Termination
Hierarchical State Management via URL Stack
Adaptive VLM Integration via US Calculator
Empirical Studies
Constructing a Reproducible Benchmark: EverWebQA
Hierarchical Web Traversal
Topology-Aware QA Synthesis
Teacher-Model Vetting
Data Statistics and Characteristics
...and 32 more sections

Figures (8)

Figure 1: The framework of V-GEMS. Our system augments the dual-agent architecture (Explorer and Critic) with a specialized Tools suite consisting of a Counter, URL Stack, and US Calculator. The US Calculator adaptively scores page content to decide between LLM and VLM processing. The URL Stack enables stateful backtracking, while the Counter ensures arithmetic precision across multiple navigation steps.For a comprehensive visualization of the end-to-end execution workflow, please refer to Appendix \ref{['sec:demonstration']}.
Figure 2: The framework of test dataset generation
Figure 3: Data distribution of the EverWebQA dataset across domains and languages.
Figure 4: The comparison in type and domain.
Figure 5: Three methods' contribution in different task
...and 3 more figures

See and Remember: A Multimodal Agent for Web Traversal

TL;DR

Abstract

See and Remember: A Multimodal Agent for Web Traversal

Authors

TL;DR

Abstract

Table of Contents

Figures (8)