Table of Contents
Fetching ...

LAGEA: Language Guided Embodied Agents for Robotic Manipulation

Abdul Monaf Chowdhury, Akm Moshiur Rahman Mazumder, Rabeya Akter, Safaeid Hossain Arib

TL;DR

LAGEA addresses the sparse-reward challenge in robotic manipulation by using structured, episodic feedback from a vision–language model as temporally localized guidance. It combines keyframe-based credit assignment, calibrated feedback–vision alignment, and delta-based reward shaping with an adaptive failure-aware coefficient $\rho_t$ to produce dense, stable signals early in learning that recede as competence grows. The approach yields faster convergence and higher success rates on Meta-World MT10 and Robotic Fetch benchmarks, outperforming strong baselines by up to 17% in average success and demonstrating robust robustness across ablations. This work demonstrates that structured language-driven diagnosis of failures, when grounded in visual representations, can effectively guide embodied agents toward better decisions with practical impact for scalable manipulation and future sim-to-real transfer. The methodology introduces a principled integration of VLM-based failure reasoning with reinforcement learning, offering a scalable path to leveraging natural language as a learning signal in embodied AI.

Abstract

Robotic manipulation benefits from foundation models that describe goals, but today's agents still lack a principled way to learn from their own mistakes. We ask whether natural language can serve as feedback, an error-reasoning signal that helps embodied agents diagnose what went wrong and correct course. We introduce LAGEA (Language Guided Embodied Agents), a framework that turns episodic, schema-constrained reflections from a vision language model (VLM) into temporally grounded guidance for reinforcement learning. LAGEA summarizes each attempt in concise language, localizes the decisive moments in the trajectory, aligns feedback with visual state in a shared representation, and converts goal progress and feedback agreement into bounded, step-wise shaping rewards whose influence is modulated by an adaptive, failure-aware coefficient. This design yields dense signals early when exploration needs direction and gracefully recedes as competence grows. On the Meta-World MT10 and Robotic Fetch embodied manipulation benchmark, LAGEA improves average success over the state-of-the-art (SOTA) methods by 9.0% on random goals, 5.3% on fixed goals, and 17% on fetch tasks, while converging faster. These results support our hypothesis: language, when structured and grounded in time, is an effective mechanism for teaching robots to self-reflect on mistakes and make better choices.

LAGEA: Language Guided Embodied Agents for Robotic Manipulation

TL;DR

LAGEA addresses the sparse-reward challenge in robotic manipulation by using structured, episodic feedback from a vision–language model as temporally localized guidance. It combines keyframe-based credit assignment, calibrated feedback–vision alignment, and delta-based reward shaping with an adaptive failure-aware coefficient to produce dense, stable signals early in learning that recede as competence grows. The approach yields faster convergence and higher success rates on Meta-World MT10 and Robotic Fetch benchmarks, outperforming strong baselines by up to 17% in average success and demonstrating robust robustness across ablations. This work demonstrates that structured language-driven diagnosis of failures, when grounded in visual representations, can effectively guide embodied agents toward better decisions with practical impact for scalable manipulation and future sim-to-real transfer. The methodology introduces a principled integration of VLM-based failure reasoning with reinforcement learning, offering a scalable path to leveraging natural language as a learning signal in embodied AI.

Abstract

Robotic manipulation benefits from foundation models that describe goals, but today's agents still lack a principled way to learn from their own mistakes. We ask whether natural language can serve as feedback, an error-reasoning signal that helps embodied agents diagnose what went wrong and correct course. We introduce LAGEA (Language Guided Embodied Agents), a framework that turns episodic, schema-constrained reflections from a vision language model (VLM) into temporally grounded guidance for reinforcement learning. LAGEA summarizes each attempt in concise language, localizes the decisive moments in the trajectory, aligns feedback with visual state in a shared representation, and converts goal progress and feedback agreement into bounded, step-wise shaping rewards whose influence is modulated by an adaptive, failure-aware coefficient. This design yields dense signals early when exploration needs direction and gracefully recedes as competence grows. On the Meta-World MT10 and Robotic Fetch embodied manipulation benchmark, LAGEA improves average success over the state-of-the-art (SOTA) methods by 9.0% on random goals, 5.3% on fixed goals, and 17% on fetch tasks, while converging faster. These results support our hypothesis: language, when structured and grounded in time, is an effective mechanism for teaching robots to self-reflect on mistakes and make better choices.

Paper Structure

This paper contains 37 sections, 14 equations, 14 figures, 10 tables, 1 algorithm.

Figures (14)

  • Figure 1: Overview of LaGEA framework. (a) After each rollout, key–frame selection identifies causal moments and computes per-step weights $\hat{w}_t$; a VLM queried on those frames returns a schema-constrained self-reflection that is encoded as a feedback embedding $f$. Trajectories, $f$, and $\hat{w}_t$ are stored in buffer $\mathcal{D}$. (b) Trainable projectors $(E_i,E_t,E_f)$ map state images $x_t$, goal $g$, instruction $y$, and $f$ into a shared space; a hybrid calibration+contrastive objective $(\mathcal{L}_{\mathrm{align}},\mathcal{L}_{\mathrm{sym}})$ enforces control relevance. (c) Computes goal-delta $\Delta R_{goal}$ and feedback-delta $\Delta R_{fb}$, fuses them with sparse task reward $R_{task}$, and produces the final dense reward for policy updates.
  • Figure 2: The computation of our delta-based rewards. (a) A Goal Potential $\phi_t$ is formed by aligning the current state $z_t$ with the goal image $z_g$ and instruction $z_y$. (b) A Feedback Potential $\psi_t$ is formed by aligning $z_t$ with the VLM feedback $z_f$. The temporal difference of these potentials creates the fused feedback-VLM rewards.
  • Figure 3: Natural-language feedback accelerates convergence: across eight Meta-World tasks, LaGEA reaches high success in far fewer steps than FuRL and SAC, which plateau late or stall.
  • Figure 4: Ablation studies on keyframe selection and reward shaping.
  • Figure 5: Alignment enables control-relevant geometry: (a) success/failure logit margin increases over training, (b) policy success accelerates, and (c) BCE/InfoNCE objectives co-train the shared space for LaGEA.
  • ...and 9 more figures