V-IRL: Grounding Virtual Intelligence in Real Life

Jihan Yang; Runyu Ding; Ellis Brown; Xiaojuan Qi; Saining Xie

V-IRL: Grounding Virtual Intelligence in Real Life

Jihan Yang, Runyu Ding, Ellis Brown, Xiaojuan Qi, Saining Xie

TL;DR

V-IRL presents an open-source platform that grounds virtual agents in real-world geospatial data and street-view imagery to close the sensory gap between digital AI and the physical world. It demonstrates Earthbound, language-driven, visually grounded, and collaborative agents operating in global urban environments, accompanied by global benchmarks that evaluate vision-language models and end-to-end agent performance on open-world data. The work highlights scalable data collection, diverse exemplars, and ethical considerations around privacy and bias, aiming to accelerate development of perceptually grounded autonomous agents with real-world utility. This platform promises to enable practical applications from urban planning to personalized assistance while inviting broad community participation and scrutiny of model behavior across diverse geographies.

Abstract

There is a sensory gulf between the Earth that humans inhabit and the digital realms in which modern AI agents are created. To develop AI agents that can sense, think, and act as flexibly as humans in real-world settings, it is imperative to bridge the realism gap between the digital and physical worlds. How can we embody agents in an environment as rich and diverse as the one we inhabit, without the constraints imposed by real hardware and control? Towards this end, we introduce V-IRL: a platform that enables agents to scalably interact with the real world in a virtual yet realistic environment. Our platform serves as a playground for developing agents that can accomplish various practical tasks and as a vast testbed for measuring progress in capabilities spanning perception, decision-making, and interaction with real-world data across the entire globe.

V-IRL: Grounding Virtual Intelligence in Real Life

TL;DR

Abstract

Paper Structure (61 sections, 24 figures, 8 tables)

This paper contains 61 sections, 24 figures, 8 tables.

Introduction
Related Work
Virtual Intelligence in Real Life
Earthbound Agents
Language-Driven Agents
Visually Grounded Agents
Collaborative Agents
Agent-Agent Collaboration
Human-Agent Collaboration
System Fundamentals
Agent Definition
Platform Components
Environment (Action)
Vision (Perception)
Language (Reasoning & Collaboration)
...and 46 more sections

Figures (24)

Figure 1: V-IRL agents leverage real-world geospatial information and street view imagery to navigate urban terrains, execute complex tasks, and interact in real-time scenarios. From recommending relevant destinations to assessing city infrastructure to collaboratively giving & following verbal directions---we develop agents that illustrate V-IRL's current capabilities, flexibility, and utility. Above all else, we present a flexible platform for researchers to harness abundant data from across the globe to create and test diverse autonomous agents.
Figure 2: Finding the shortest path for Peng to travel to five places.
Figure 3: Imani's visualization of trash bins,hydrant_green fire hydrants,bench_orange & park benches in NYC's Central Park using data collected by RX-399.
Figure 4: Portions of RX-399's system records in HK and NYC.
Figure 5: RX-399 avoids double-counting trash cans by identifying duplicates across different viewpoints using feature matching.
...and 19 more figures

V-IRL: Grounding Virtual Intelligence in Real Life

TL;DR

Abstract

V-IRL: Grounding Virtual Intelligence in Real Life

Authors

TL;DR

Abstract

Table of Contents

Figures (24)