Table of Contents
Fetching ...

ReflexGrad: Three-Way Synergistic Architecture for Zero-Shot Generalization in LLM Agents

Ankush Kadu, Ashwanth Krishnan

TL;DR

ReflexGrad presents a triple-synergy framework that tightly couples hierarchical TODO decomposition, history-aware causal reflexion, and gradient-based prompt optimization to enable true zero-shot generalization in LLM agents. By wiring these components into a bidirectional feedback loop, the system derives actionable failure patterns, propagates corrective gradients, and improves task decomposition and memory consolidation across trials. Empirical results on ALFWorld show zero-shot Trial 0 success of 67% with zero loops and complete alignment among components, with cross-trial gains up to 78%, approaching few-shot baselines in a harder zero-shot setting. The work highlights the importance of semantic cross-task transfer, memory structuring, and convergence dynamics for robust, scalable generalization in interactive agents.

Abstract

Enabling agents to learn from experience and generalize across diverse tasks without task-specific training remains a fundamental challenge in reinforcement learning and decision-making. While recent approaches have explored episodic memory (Reflexion), gradient-based prompt optimization (TextGrad),and hierarchical task decomposition independently, their potential for synergistic integration remains unexplored. We introduce ReflexGrad, a novel architecture that tightly couples three complementary mechanisms: (1) LLM-based hierarchical TODO decomposition for strategic planning, (2) history-aware causal reflection that analyzes recent action patterns to identify failure root causes and enable within-trial learning, and (3) gradient-based optimization for systematic improvement. Unlike prior work relying on few-shot demonstrations, our system achieves true zero-shot generalization through pure LLM semantic reasoning,requiring no task-specific examples, fine-tuning, or hardcoded similarity metrics. Evaluated on ALFWorld benchmark tasks, ReflexGrad demonstrates 67% zero-shot success rate on Trial 0 without any prior task experience or demonstrations, establishing effective performance on first exposure. Through empirical analysis, we identify the architectural mechanisms underlying stable convergence (zero action loops) and effective cross-task transfer (67% to 78% improvement).Our work demonstrates that synergistic integration of complementary learning mechanisms enables robust zero-shot generalization that approaches few-shot baselines from prior work.

ReflexGrad: Three-Way Synergistic Architecture for Zero-Shot Generalization in LLM Agents

TL;DR

ReflexGrad presents a triple-synergy framework that tightly couples hierarchical TODO decomposition, history-aware causal reflexion, and gradient-based prompt optimization to enable true zero-shot generalization in LLM agents. By wiring these components into a bidirectional feedback loop, the system derives actionable failure patterns, propagates corrective gradients, and improves task decomposition and memory consolidation across trials. Empirical results on ALFWorld show zero-shot Trial 0 success of 67% with zero loops and complete alignment among components, with cross-trial gains up to 78%, approaching few-shot baselines in a harder zero-shot setting. The work highlights the importance of semantic cross-task transfer, memory structuring, and convergence dynamics for robust, scalable generalization in interactive agents.

Abstract

Enabling agents to learn from experience and generalize across diverse tasks without task-specific training remains a fundamental challenge in reinforcement learning and decision-making. While recent approaches have explored episodic memory (Reflexion), gradient-based prompt optimization (TextGrad),and hierarchical task decomposition independently, their potential for synergistic integration remains unexplored. We introduce ReflexGrad, a novel architecture that tightly couples three complementary mechanisms: (1) LLM-based hierarchical TODO decomposition for strategic planning, (2) history-aware causal reflection that analyzes recent action patterns to identify failure root causes and enable within-trial learning, and (3) gradient-based optimization for systematic improvement. Unlike prior work relying on few-shot demonstrations, our system achieves true zero-shot generalization through pure LLM semantic reasoning,requiring no task-specific examples, fine-tuning, or hardcoded similarity metrics. Evaluated on ALFWorld benchmark tasks, ReflexGrad demonstrates 67% zero-shot success rate on Trial 0 without any prior task experience or demonstrations, establishing effective performance on first exposure. Through empirical analysis, we identify the architectural mechanisms underlying stable convergence (zero action loops) and effective cross-task transfer (67% to 78% improvement).Our work demonstrates that synergistic integration of complementary learning mechanisms enables robust zero-shot generalization that approaches few-shot baselines from prior work.

Paper Structure

This paper contains 38 sections, 10 equations, 1 figure, 4 tables, 3 algorithms.

Figures (1)

  • Figure 1: ReflexGrad Dual-Loop Self-Evolution Mechanism. The system operates through continuous execution with two parallel feedback loops around a central episodic memory core. Forward Pass: (1) Action selection receives optimized policy + reflexion insights, (2) Execute action in environment, (3) Compute TextGrad loss. Backward Pass: (4) TextGrad backward propagation with reflexion context, (5) Append gradient to policy with TODO tracking. Loop 1 (Left): Optimizer synthesizes accumulated gradients every 3 steps to update policy via LLM-based merge. Loop 2 (Bottom): Reflexion check triggers every 5 steps or on failure, generating causal insights stored in working reflexion buffer and episodic memory. Both loops feed back to action selection. The central circular representation shows the episodic memory system that provides context to all components. The key innovation is three-way synergistic coupling: TODO tracking guides both reflexion and gradient accumulation, reflexions inform TextGrad backward pass, and gradients determine TODO progression.

Theorems & Definitions (1)

  • Definition 1: Episodic MDP with Memory