Table of Contents
Fetching ...

Multimodal LLM-assisted Evolutionary Search for Programmatic Control Policies

Qinglong Hu, Xialiang Tong, Mingxuan Yuan, Fei Liu, Zhichao Lu, Qingfu Zhang

TL;DR

This work presents MLES, a framework that uses multimodal large language models to directly synthesize programmatic control policies within an evolutionary search loop, augmented by visual-behavioral evidence to guide refinement. By representing each policy as an executable code block plus natural-language rationale and qualitative behavioral cues, MLES provides transparent, human-aligned control logic while achieving performance on par with PPO on standard benchmarks. The approach demonstrates improved search efficiency, robustness, and knowledge reuse, and the ablation studies underscore the essential roles of BE, multimodal modification, and exploration operators. The framework also discusses practical considerations such as API-based MLLM costs and potential extensions to hybrid architectures and human-in-the-loop collaboration, signaling a promising new paradigm for verifiable control policy design.

Abstract

Deep reinforcement learning has achieved impressive success in control tasks. However, its policies, represented as opaque neural networks, are often difficult for humans to understand, verify, and debug, which undermines trust and hinders real-world deployment. This work addresses this challenge by introducing a novel approach for programmatic control policy discovery, called Multimodal Large Language Model-assisted Evolutionary Search (MLES). MLES utilizes multimodal large language models as programmatic policy generators, combining them with evolutionary search to automate policy generation. It integrates visual feedback-driven behavior analysis within the policy generation process to identify failure patterns and guide targeted improvements, thereby enhancing policy discovery efficiency and producing adaptable, human-aligned policies. Experimental results demonstrate that MLES achieves performance comparable to Proximal Policy Optimization (PPO) across two standard control tasks while providing transparent control logic and traceable design processes. This approach also overcomes the limitations of predefined domain-specific languages, facilitates knowledge transfer and reuse, and is scalable across various tasks, showing promise as a new paradigm for developing transparent and verifiable control policies.

Multimodal LLM-assisted Evolutionary Search for Programmatic Control Policies

TL;DR

This work presents MLES, a framework that uses multimodal large language models to directly synthesize programmatic control policies within an evolutionary search loop, augmented by visual-behavioral evidence to guide refinement. By representing each policy as an executable code block plus natural-language rationale and qualitative behavioral cues, MLES provides transparent, human-aligned control logic while achieving performance on par with PPO on standard benchmarks. The approach demonstrates improved search efficiency, robustness, and knowledge reuse, and the ablation studies underscore the essential roles of BE, multimodal modification, and exploration operators. The framework also discusses practical considerations such as API-based MLLM costs and potential extensions to hybrid architectures and human-in-the-loop collaboration, signaling a promising new paradigm for verifiable control policy design.

Abstract

Deep reinforcement learning has achieved impressive success in control tasks. However, its policies, represented as opaque neural networks, are often difficult for humans to understand, verify, and debug, which undermines trust and hinders real-world deployment. This work addresses this challenge by introducing a novel approach for programmatic control policy discovery, called Multimodal Large Language Model-assisted Evolutionary Search (MLES). MLES utilizes multimodal large language models as programmatic policy generators, combining them with evolutionary search to automate policy generation. It integrates visual feedback-driven behavior analysis within the policy generation process to identify failure patterns and guide targeted improvements, thereby enhancing policy discovery efficiency and producing adaptable, human-aligned policies. Experimental results demonstrate that MLES achieves performance comparable to Proximal Policy Optimization (PPO) across two standard control tasks while providing transparent control logic and traceable design processes. This approach also overcomes the limitations of predefined domain-specific languages, facilitates knowledge transfer and reuse, and is scalable across various tasks, showing promise as a new paradigm for developing transparent and verifiable control policies.

Paper Structure

This paper contains 58 sections, 2 equations, 23 figures, 11 tables, 1 algorithm.

Figures (23)

  • Figure 1: Overview of methodological differences. (a) Standard DRL: agents learn via reward-guided interaction with environments. (b) Our MLES: directly evolves programmatic policies by integrating behavior analysis during the EC-based policy discovery process.
  • Figure 2: An overview of the MLES framework. The left side of the MLES illustrates the evolutionary search loop, while the right side details the structure and construction of an evolutionary individual. The red module on the far right exemplifies a method for generating behavioral evidence. During each search step, a subset of parent individuals is selected from a policy pool and used by the prompt sampler to create a multimodal few-shot prompt. MLLMs reason over this prompt to generate a new offspring policy. The offspring is then evaluated and visualized, resulting in the creation of a new individual that is added to the policy pool and managed accordingly.
  • Figure 3: Convergence on Lunar Lander task
  • Figure 4: Convergence on Car Racing task
  • Figure 5: Evolutionary process of Car Racing policies. The plot depicts population score distributions over generations, with yellow lines tracing all ancestors of the best-performing policy. The blue lineage is examined in detail to reveal the stepwise improvements guided by BE-driven insights.
  • ...and 18 more figures