Table of Contents
Fetching ...

AURA: Multimodal Shared Autonomy for Real-World Urban Navigation

Yukai Ma, Honglin He, Selina Song, Wayne Wu, Bolei Zhou

Abstract

Long-horizon navigation in complex urban environments relies heavily on continuous human operation, which leads to fatigue, reduced efficiency, and safety concerns. Shared autonomy, where a Vision-Language AI agent and a human operator collaborate on maneuvering the mobile machine, presents a promising solution to address these issues. However, existing shared autonomy methods often require humans and AI to operate within the same action space, leading to high cognitive overhead. We present Assistive Urban Robot Autonomy (AURA), a new multi-modal framework that decomposes urban navigation into high-level human instruction and low-level AI control. AURA incorporates a Spatial-Aware Instruction Encoder to align various human instructions with visual and spatial context. To facilitate training, we construct MM-CoS, a large-scale dataset comprising teleoperation and vision-language descriptions. Experiments in simulation and the real world demonstrate that AURA effectively follows human instructions, reduces manual operation effort, and improves navigation stability, while enabling online adaptation. Moreover, under similar takeover conditions, our shared autonomy framework reduces the frequency of takeovers by more than 44%. Demo video and more detail are provided in the project page.

AURA: Multimodal Shared Autonomy for Real-World Urban Navigation

Abstract

Long-horizon navigation in complex urban environments relies heavily on continuous human operation, which leads to fatigue, reduced efficiency, and safety concerns. Shared autonomy, where a Vision-Language AI agent and a human operator collaborate on maneuvering the mobile machine, presents a promising solution to address these issues. However, existing shared autonomy methods often require humans and AI to operate within the same action space, leading to high cognitive overhead. We present Assistive Urban Robot Autonomy (AURA), a new multi-modal framework that decomposes urban navigation into high-level human instruction and low-level AI control. AURA incorporates a Spatial-Aware Instruction Encoder to align various human instructions with visual and spatial context. To facilitate training, we construct MM-CoS, a large-scale dataset comprising teleoperation and vision-language descriptions. Experiments in simulation and the real world demonstrate that AURA effectively follows human instructions, reduces manual operation effort, and improves navigation stability, while enabling online adaptation. Moreover, under similar takeover conditions, our shared autonomy framework reduces the frequency of takeovers by more than 44%. Demo video and more detail are provided in the project page.

Paper Structure

This paper contains 34 sections, 2 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Shared Autonomy for Urban Navigation. We introduce AURA, a dual-system VLA for shared autonomy in urban navigation. AURA not only follows instructions but also enables human users to guide and correct a robot in real time through various visual and language instructions.
  • Figure 2: Overview of the AURA shared autonomy framework. (a) AURA takes front-camera RGB observations and optional human guidance (e.g., texting, drafting, or arrowing). Observations are encoded by a ViT, while human inputs are processed by the SIE and tokenized; all tokens are fused in a pretrained LLM (with LoRA adapters) to produce context features. A diffusion-based action decoder then predicts a distribution over future trajectories via anchor proposals. (b) The SIE converts drafting/arrowing inputs into instruction tokens: it renders the human input as visual prompts, encodes the control points/vectors, and fuses them with visual features to produce instruction embeddings that are injected into the LLM via $\langle\texttt{instruction}\rangle$.
  • Figure 3: Samples from the auto-labeling pipeline. Each frame is annotated with three training labels produced by our auto-labeling pipeline: (1) the texting command expressed as a short verb phrase (e.g., "go straight", "slow down", "speed up"), (2) the drafting, visualized as a rendered path from the ground-truth future trajectory, and (3) the arrowing input, represented by instantaneous speed. The rightmost panel shows the reasoning traces used to supervise drafting and arrowing prediction.
  • Figure 4: Visualization of offline inference in MM-CoS. We illustrate three types of human instructions. The green polygon denotes the future trajectories predicted by AURA.
  • Figure 5: Human Cost Evaluation in Pseudo-simulation. We compare the human intervention cost of our model against prior methods mirowski2016learningliu2024citywalker, as well as across different modes of instruction guidance within our own framework.
  • ...and 13 more figures