Multimodal Policy Internalization for Conversational Agents
Zhenhailong Wang, Jiateng Liu, Amin Fazel, Ritesh Sarkhel, Xing Fan, Xiang Li, Chenlei Guo, Heng Ji, Ruhi Sarikaya
TL;DR
The paper addresses the challenge of obeying long, complex multimodal policies in conversational agents by internalizing policy knowledge into model parameters. It introduces Multimodal Policy Internalization (MPI) and a three-stage TriMPI training framework (Visually-Masked Continual Pretraining, CoT supervised fine-tuning, and PolicyRollout-enabled RL) to embed policy behavior without requiring the policy at inference. Two new datasets, ClevrPolicy and GTAPolicy, cover synthetic decision-making and real-world tool-usage tasks, enabling comprehensive evaluations of policy complexity, generalization, and forgetting robustness. TriMPI achieves substantial gains over baselines, demonstrates strong generalization to policy updates, and reduces dependency on in-context policy prompts, offering practical efficiency and reliability improvements for multimodal agents.
Abstract
Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are critical but remain understudied. Prior prompt-compression work mainly shortens task templates and demonstrations, while existing policy-alignment studies focus only on text-based safety rules. We introduce Multimodal Policy Internalization (MPI), a new task that internalizes reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference. MPI poses unique data and algorithmic challenges. We build two datasets spanning synthetic and real-world decision-making and tool-using tasks and propose TriMPI, a three-stage training framework. TriMPI first injects policy knowledge via continual pretraining, then performs supervised finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. As the first work on multimodal policy internalization, we provide datasets, training recipes, and comprehensive evaluations to foster future research. Project page: https://mikewangwzhl.github.io/TriMPI.
