Table of Contents
Fetching ...

CodeMem: Architecting Reproducible Agents via Dynamic MCP and Procedural Memory

Nishant Gaurav, Adit Akarsh, Tejas Ravishankar, Manoj Bajaj

TL;DR

CodeMem tackles reproducibility in tool-using agents by replacing probabilistic, prompt-driven logic with a sandboxed Python workflow and a persistent procedural memory bank. By integrating Dynamic MCP-based on-demand tool discovery, a write_todos planning memory, and a mechanism to freeze validated code as reusable skills, CodeMem achieves deterministic execution and scalable tool integration. The paper demonstrates improved reliability, reduced latency, and robustness to context drift, supported by a case study and quantitative experiments across foundational models. This approach enables building reusable agentic workflows that are more reliable for repetitive tasks in production environments.

Abstract

Current tool-using AI agents suffer from limited action space, context inefficiency, and probabilistic instability that makes them unsuitable for handling repetitive tasks which are otherwise reliably and efficiently tackled by agentic workflows built on platforms like n8n and Zapier. Earlier works like CodeAct, DynaSaur, Code Mode have tried to tackle the first two issues by using the whole Python language as its action space: The number of tools that the agent can call becomes infinite. Python code blocks can execute complex actions into a single step and print only relevant results which helps in keeping the context lean. However, the probabilistic instability issue still remains, as for the same task in the same environment, the agent can follow different trajectories due to the probabilistic nature of LLMs. Therefore, we need procedural memory for consistency and reliability. This paper proposes CodeMem, an architecture to implement procedural memory via code which can be used to build and run reusable agentic workflows with deterministic reliability.

CodeMem: Architecting Reproducible Agents via Dynamic MCP and Procedural Memory

TL;DR

CodeMem tackles reproducibility in tool-using agents by replacing probabilistic, prompt-driven logic with a sandboxed Python workflow and a persistent procedural memory bank. By integrating Dynamic MCP-based on-demand tool discovery, a write_todos planning memory, and a mechanism to freeze validated code as reusable skills, CodeMem achieves deterministic execution and scalable tool integration. The paper demonstrates improved reliability, reduced latency, and robustness to context drift, supported by a case study and quantitative experiments across foundational models. This approach enables building reusable agentic workflows that are more reliable for repetitive tasks in production environments.

Abstract

Current tool-using AI agents suffer from limited action space, context inefficiency, and probabilistic instability that makes them unsuitable for handling repetitive tasks which are otherwise reliably and efficiently tackled by agentic workflows built on platforms like n8n and Zapier. Earlier works like CodeAct, DynaSaur, Code Mode have tried to tackle the first two issues by using the whole Python language as its action space: The number of tools that the agent can call becomes infinite. Python code blocks can execute complex actions into a single step and print only relevant results which helps in keeping the context lean. However, the probabilistic instability issue still remains, as for the same task in the same environment, the agent can follow different trajectories due to the probabilistic nature of LLMs. Therefore, we need procedural memory for consistency and reliability. This paper proposes CodeMem, an architecture to implement procedural memory via code which can be used to build and run reusable agentic workflows with deterministic reliability.

Paper Structure

This paper contains 35 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Mechanism for updating procedural memory via instruction tuning. The agent analyzes the conversation history and existing instructions to generate refined new instructions (e.g., adding constraints like "Do not use hashtags"). This process, utilized by the LangGraph Tweet generator, relies on meta-prompting rather than code modification langchain_memory_concepts.
  • Figure 2: The CodeMem Architecture. The Agent creates reproducible workflows by discovering tools (Left), executing them in a sandbox (Center), and freezing successful logic into persistent memory (Right).