Table of Contents
Fetching ...

VA-FastNavi-MARL: Real-Time Robot Control with Multimedia-Driven Meta-Reinforcement Learning

Yang Zhang, Shengxi Jing, Fengxiang Wang, Yuan Feng, Hong Wang

Abstract

Interpreting dynamic, heterogeneous multimedia commands with real-time responsiveness is critical for Human-Robot Interaction. We present VA-FastNavi-MARL, a framework that aligns asynchronous audio-visual inputs into a unified latent representation. By treating diverse instructions as a distribution of navigable goals via Meta-Reinforcement Learning, our method enables rapid adaptation to unseen directives with negligible inference overhead. Unlike approaches bottlenecked by heavy sensory processing, our modality-agnostic stream ensures seamless, low-latency control. Validation on a multi-arm workspace confirms that VA-FastNavi-MARL significantly outperforms baselines in sample efficiency and maintains robust, real-time execution even under noisy multimedia streams.

VA-FastNavi-MARL: Real-Time Robot Control with Multimedia-Driven Meta-Reinforcement Learning

Abstract

Interpreting dynamic, heterogeneous multimedia commands with real-time responsiveness is critical for Human-Robot Interaction. We present VA-FastNavi-MARL, a framework that aligns asynchronous audio-visual inputs into a unified latent representation. By treating diverse instructions as a distribution of navigable goals via Meta-Reinforcement Learning, our method enables rapid adaptation to unseen directives with negligible inference overhead. Unlike approaches bottlenecked by heavy sensory processing, our modality-agnostic stream ensures seamless, low-latency control. Validation on a multi-arm workspace confirms that VA-FastNavi-MARL significantly outperforms baselines in sample efficiency and maintains robust, real-time execution even under noisy multimedia streams.

Paper Structure

This paper contains 22 sections, 7 equations, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: Architecture of VA-FastNavi-MARL.Multimedia Instruction Generator (Top) Heterogeneous inputs (machine, audio, visual) are fused into a unified embedding ($\varphi_n$) via parallel encoders. Dynamic Instruction Stream (Middle) An asynchronous buffer schedules real-time commands ($\varphi_t$) to handle irregular input rates. Task-adaptive Motion Controller (Bottom) The active instruction parameterizes the task ($T_t$), triggering the Meta-RL policy $\pi_{\theta}$ to adapt into a specialized task policy $\pi_{\theta_t}$ for control.
  • Figure 2: Architecture of the policy network. The network processes a 15-dimensional state vector through two hidden layers and uses a dual-head structure to output the parameters ($\mu$ and $\log \sigma$) of a 9-dimensional Gaussian policy. Output values are clamped for numerical stability before final action sampling and scaling.
  • Figure 3: Performance comparison across different models. From left to right: Average Reward, Success Rate, and Collision Rate. Our method (Solid Red) achieves faster adaptation and higher safety compared to baselines.
  • Figure 4: Trajectory evolution (Top) during the Long-Horizon Continuous Adaptation test ($T=2000$ steps).The experiment is segmented into five distinct phases (separated by vertical dashed lines), where the underlying instruction logic shifts abruptly from simple instructions (e.g., Seq: [1,2,3]) to complex instructions (e.g., Seq: [2,3,1,3,2,3]). Bottom is the state activation timeline corresponding to the continuous adaptation task. Blue segments denote the Approaching phase (active execution), while pink regions represent the Returning/Holding phase.