Table of Contents
Fetching ...

Coding Agents with Multimodal Browsing are Generalist Problem Solvers

Aditya Bharat Soni, Boxuan Li, Xingyao Wang, Valerie Chen, Graham Neubig

TL;DR

OpenHands-Versa addresses generalization by using three core capabilities—coding, multimodal web browsing, and information access—implemented as a generalist agent built on the OpenHands framework. It integrates Set-of-Marks visual browsing, a browsing condenser, search API-based information retrieval, and lightweight planning to orchestrate tasks across diverse domains. Across GAIA, SWE-Bench Multimodal, and The Agent Company, it achieves state-of-the-art or near state-of-the-art performance, outperforming specialized multi-agent systems and delivering substantial gains in resolve rate and completion metrics. The results demonstrate that a generalist design with a modest toolkit can solve heterogeneous tasks at scale, while also highlighting ongoing challenges such as CAPTCHA barriers, hallucinations from summaries, and certain failure modes that warrant further refinement.

Abstract

Modern human labor is characterized by specialization; we train for years and develop particular tools that allow us to perform well across a variety of tasks. In addition, AI agents have been specialized for domains such as software engineering, web navigation, and workflow automation. However, this results in agents that are good for one thing but fail to generalize beyond their intended scope. One reason for this is that agent developers provide a highly specialized set of tools or make architectural decisions optimized for a specific use case or benchmark. In this work, we ask the question: what is the minimal set of general tools that can be used to achieve high performance across a diverse set of tasks? Our answer is OpenHands-Versa, a generalist agent built with a modest number of general tools: code editing and execution, web search, as well as multimodal web browsing and file access. Importantly, OpenHands-Versa demonstrates superior or competitive performance over leading specialized agents across three diverse and challenging benchmarks: SWE-Bench Multimodal, GAIA, and The Agent Company, outperforming the best-performing previously published results with absolute improvements in success rate of 9.1, 1.3, and 9.1 points respectively. Further, we show how existing state-of-the-art multi-agent systems fail to generalize beyond their target domains. These results demonstrate the feasibility of developing a generalist agent to solve diverse tasks and establish OpenHands-Versa as a strong baseline for future research.

Coding Agents with Multimodal Browsing are Generalist Problem Solvers

TL;DR

OpenHands-Versa addresses generalization by using three core capabilities—coding, multimodal web browsing, and information access—implemented as a generalist agent built on the OpenHands framework. It integrates Set-of-Marks visual browsing, a browsing condenser, search API-based information retrieval, and lightweight planning to orchestrate tasks across diverse domains. Across GAIA, SWE-Bench Multimodal, and The Agent Company, it achieves state-of-the-art or near state-of-the-art performance, outperforming specialized multi-agent systems and delivering substantial gains in resolve rate and completion metrics. The results demonstrate that a generalist design with a modest toolkit can solve heterogeneous tasks at scale, while also highlighting ongoing challenges such as CAPTCHA barriers, hallucinations from summaries, and certain failure modes that warrant further refinement.

Abstract

Modern human labor is characterized by specialization; we train for years and develop particular tools that allow us to perform well across a variety of tasks. In addition, AI agents have been specialized for domains such as software engineering, web navigation, and workflow automation. However, this results in agents that are good for one thing but fail to generalize beyond their intended scope. One reason for this is that agent developers provide a highly specialized set of tools or make architectural decisions optimized for a specific use case or benchmark. In this work, we ask the question: what is the minimal set of general tools that can be used to achieve high performance across a diverse set of tasks? Our answer is OpenHands-Versa, a generalist agent built with a modest number of general tools: code editing and execution, web search, as well as multimodal web browsing and file access. Importantly, OpenHands-Versa demonstrates superior or competitive performance over leading specialized agents across three diverse and challenging benchmarks: SWE-Bench Multimodal, GAIA, and The Agent Company, outperforming the best-performing previously published results with absolute improvements in success rate of 9.1, 1.3, and 9.1 points respectively. Further, we show how existing state-of-the-art multi-agent systems fail to generalize beyond their target domains. These results demonstrate the feasibility of developing a generalist agent to solve diverse tasks and establish OpenHands-Versa as a strong baseline for future research.

Paper Structure

This paper contains 22 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Comparison of OpenHands-Versa with previously published SOTA agents and OpenHands across GAIA, SWE-Bench Multimodal (SWE-Bench M) and The Agent Company. OpenHands-Versa outperforms the SOTA specialist agents for all three benchmarks. Notably, OpenHands-Versa improves browsing and information access abilities of OpenHands, while maintaining its software engineering capabilities. We focus on comparing to prior agents with reproducible code and results (more details in Table \ref{['tab:results']})
  • Figure 2: Distribution of the different tools used by OpenHands and OpenHands-Versa. OpenHands-Versa adapts its tool usage to different benchmarks without any benchmark-specific optimizations and OpenHands-Versa has better domain-aware usage of its tools as compared to OpenHands. Tools with bold-faced names have been modified/created by our work.
  • Figure 3: Example screenshot of a webpage with set-of-marks annotation