Table of Contents
Fetching ...

SOMA: Strategic Orchestration and Memory-Augmented System for Vision-Language-Action Model Robustness via In-Context Adaptation

Zhuoran Li, Zhiyang Li, Kaijun Zhou, Jinyu Gu

Abstract

Despite the promise of Vision-Language-Action (VLA) models as generalist robotic controllers, their robustness against perceptual noise and environmental variations in out-of-distribution (OOD) tasks remains fundamentally limited by the absence of long-term memory, causal failure attribution, and dynamic intervention capability. To address this, we propose SOMA, a Strategic Orchestration and Memory-Augmented System that upgrades frozen VLA policies for robust in-context adaptation without parameter fine-tuning. Specifically, SOMA operates through an online pipeline of contrastive Dual-Memory Retrieval-Augmented Generation (RAG), an Attribution-Driven Large-Language-Model (LLM) Orchestrator, and extensible Model Context Protocol (MCP) interventions, while an offline Memory Consolidation module continuously distills the execution traces into reliable priors. Experimental evaluations across three backbone models (pi0, pi0.5, and SmolVLA) on LIBERO-PRO and our proposed LIBERO-SOMA benchmarks demonstrate that SOMA achieves an average absolute success rate gain of 56.6%. This includes a significant absolute improvement of 89.1% in long-horizon task chaining. Project page and source code are available at: https://github.com/LZY-1021/SOMA.

SOMA: Strategic Orchestration and Memory-Augmented System for Vision-Language-Action Model Robustness via In-Context Adaptation

Abstract

Despite the promise of Vision-Language-Action (VLA) models as generalist robotic controllers, their robustness against perceptual noise and environmental variations in out-of-distribution (OOD) tasks remains fundamentally limited by the absence of long-term memory, causal failure attribution, and dynamic intervention capability. To address this, we propose SOMA, a Strategic Orchestration and Memory-Augmented System that upgrades frozen VLA policies for robust in-context adaptation without parameter fine-tuning. Specifically, SOMA operates through an online pipeline of contrastive Dual-Memory Retrieval-Augmented Generation (RAG), an Attribution-Driven Large-Language-Model (LLM) Orchestrator, and extensible Model Context Protocol (MCP) interventions, while an offline Memory Consolidation module continuously distills the execution traces into reliable priors. Experimental evaluations across three backbone models (pi0, pi0.5, and SmolVLA) on LIBERO-PRO and our proposed LIBERO-SOMA benchmarks demonstrate that SOMA achieves an average absolute success rate gain of 56.6%. This includes a significant absolute improvement of 89.1% in long-horizon task chaining. Project page and source code are available at: https://github.com/LZY-1021/SOMA.
Paper Structure (33 sections, 16 equations, 10 figures, 3 tables)

This paper contains 33 sections, 16 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Mastering Long-Horizon Manipulation via Memory-Driven Decomposition and Chaining. SOMA transforms abstract, multi-step instructions into precise execution sequences by orchestrating intent parsing, memory-based retrieval, and adaptive control-flow regulation.
  • Figure 2: Comparing attention maps under varied inputs. Red and blue indicate high and low attention. Baseline maps (red frames) exhibit diffuse, non-specific focus. Targeted MCP interventions (green frames) facilitate precise object identification and refined edge delineation through: (a) Overlaying visual cues; (b) Simplifying linguistic prompts; (c) Filtering background clutter; and (d) Decomposing long-horizon subtasks.
  • Figure 3: SOMA Framework. Given the current observation and instruction, the Dual-Memory RAG module retrieves relevant experiences from a dual-memory bank. Based on this context, the LLM Orchestrator matches the MCP toolset, estimates execution parameters $\theta$, and synthesizes an intervention chain, which is executed by the extensible MCP tools. Post-execution, memory consolidation runs asynchronously to update the Dual-Memory Bank, closing the loop for iterative refinement.
  • Figure 4: Visual Focus. Addressing visual shift in complex manipulation environments.
  • Figure 5: Clutter Removal. Erasing irrelevant distractors to mitigate causal confusion.
  • ...and 5 more figures