Multimodal LLM-assisted Evolutionary Search for Programmatic Control Policies
Qinglong Hu, Xialiang Tong, Mingxuan Yuan, Fei Liu, Zhichao Lu, Qingfu Zhang
TL;DR
This work presents MLES, a framework that uses multimodal large language models to directly synthesize programmatic control policies within an evolutionary search loop, augmented by visual-behavioral evidence to guide refinement. By representing each policy as an executable code block plus natural-language rationale and qualitative behavioral cues, MLES provides transparent, human-aligned control logic while achieving performance on par with PPO on standard benchmarks. The approach demonstrates improved search efficiency, robustness, and knowledge reuse, and the ablation studies underscore the essential roles of BE, multimodal modification, and exploration operators. The framework also discusses practical considerations such as API-based MLLM costs and potential extensions to hybrid architectures and human-in-the-loop collaboration, signaling a promising new paradigm for verifiable control policy design.
Abstract
Deep reinforcement learning has achieved impressive success in control tasks. However, its policies, represented as opaque neural networks, are often difficult for humans to understand, verify, and debug, which undermines trust and hinders real-world deployment. This work addresses this challenge by introducing a novel approach for programmatic control policy discovery, called Multimodal Large Language Model-assisted Evolutionary Search (MLES). MLES utilizes multimodal large language models as programmatic policy generators, combining them with evolutionary search to automate policy generation. It integrates visual feedback-driven behavior analysis within the policy generation process to identify failure patterns and guide targeted improvements, thereby enhancing policy discovery efficiency and producing adaptable, human-aligned policies. Experimental results demonstrate that MLES achieves performance comparable to Proximal Policy Optimization (PPO) across two standard control tasks while providing transparent control logic and traceable design processes. This approach also overcomes the limitations of predefined domain-specific languages, facilitates knowledge transfer and reuse, and is scalable across various tasks, showing promise as a new paradigm for developing transparent and verifiable control policies.
