Table of Contents
Fetching ...

Multimodal Policy Internalization for Conversational Agents

Zhenhailong Wang, Jiateng Liu, Amin Fazel, Ritesh Sarkhel, Xing Fan, Xiang Li, Chenlei Guo, Heng Ji, Ruhi Sarikaya

TL;DR

The paper addresses the challenge of obeying long, complex multimodal policies in conversational agents by internalizing policy knowledge into model parameters. It introduces Multimodal Policy Internalization (MPI) and a three-stage TriMPI training framework (Visually-Masked Continual Pretraining, CoT supervised fine-tuning, and PolicyRollout-enabled RL) to embed policy behavior without requiring the policy at inference. Two new datasets, ClevrPolicy and GTAPolicy, cover synthetic decision-making and real-world tool-usage tasks, enabling comprehensive evaluations of policy complexity, generalization, and forgetting robustness. TriMPI achieves substantial gains over baselines, demonstrates strong generalization to policy updates, and reduces dependency on in-context policy prompts, offering practical efficiency and reliability improvements for multimodal agents.

Abstract

Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are critical but remain understudied. Prior prompt-compression work mainly shortens task templates and demonstrations, while existing policy-alignment studies focus only on text-based safety rules. We introduce Multimodal Policy Internalization (MPI), a new task that internalizes reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference. MPI poses unique data and algorithmic challenges. We build two datasets spanning synthetic and real-world decision-making and tool-using tasks and propose TriMPI, a three-stage training framework. TriMPI first injects policy knowledge via continual pretraining, then performs supervised finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. As the first work on multimodal policy internalization, we provide datasets, training recipes, and comprehensive evaluations to foster future research. Project page: https://mikewangwzhl.github.io/TriMPI.

Multimodal Policy Internalization for Conversational Agents

TL;DR

The paper addresses the challenge of obeying long, complex multimodal policies in conversational agents by internalizing policy knowledge into model parameters. It introduces Multimodal Policy Internalization (MPI) and a three-stage TriMPI training framework (Visually-Masked Continual Pretraining, CoT supervised fine-tuning, and PolicyRollout-enabled RL) to embed policy behavior without requiring the policy at inference. Two new datasets, ClevrPolicy and GTAPolicy, cover synthetic decision-making and real-world tool-usage tasks, enabling comprehensive evaluations of policy complexity, generalization, and forgetting robustness. TriMPI achieves substantial gains over baselines, demonstrates strong generalization to policy updates, and reduces dependency on in-context policy prompts, offering practical efficiency and reliability improvements for multimodal agents.

Abstract

Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are critical but remain understudied. Prior prompt-compression work mainly shortens task templates and demonstrations, while existing policy-alignment studies focus only on text-based safety rules. We introduce Multimodal Policy Internalization (MPI), a new task that internalizes reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference. MPI poses unique data and algorithmic challenges. We build two datasets spanning synthetic and real-world decision-making and tool-using tasks and propose TriMPI, a three-stage training framework. TriMPI first injects policy knowledge via continual pretraining, then performs supervised finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. As the first work on multimodal policy internalization, we provide datasets, training recipes, and comprehensive evaluations to foster future research. Project page: https://mikewangwzhl.github.io/TriMPI.

Paper Structure

This paper contains 49 sections, 6 equations, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Motivation of the proposed Multimodal Policy Internalization task. The goal is to enhance the policy-following abilities of a large multimodal model without requiring the policy to be provided in-context during inference, thereby improving both performance and efficiency.
  • Figure 2: ClevrPolicy dataset. Left: Illustration of policy generation, where a decision tree is first generated and converted into natural language instructions (see Appendix \ref{['subapp:data_clevr_policy']} for details on the decision node ontology, and Figures \ref{['fig:policy_example_clevrpolicy_t']}, \ref{['fig:policy_example_clevrpolicy_m']} for full policy examples). Right: Example input-output pair corresponding to the policy. The policy is available only during training and not during inference.
  • Figure 3: GTAPolicy dataset. Left: illustration of the policy, consisting of two major parts, tool description and tool calling rules (see Figure \ref{['fig:policy_example_gtapolicy']} for the full policy). Right: input and output example corresponding to the policy. The visual input can contain multiple images.
  • Figure 4: Overview of different training algorithms for multimodal policy internalization. The solid purple outlines indicate the parts where the next-token prediction loss is computed. On the right, we illustrate the proposed three-stage training strategy, TriMPI, which enables direct policy knowledge injection through the VM-CPT stage and policy-grounded reinforcement learning through PolicyRollout. The PolicyRollout algorithm is detailed in §\ref{['subsec:policyrollout']} and illustrated in Figure \ref{['fig:policyrollout']}.
  • Figure 5: Illustration of the PolicyRollout algorithm (applied to GRPO as an example). During the rollout phase, we additionally construct a set of input instances with the policy included in-context. These policy-aware responses are added to the rollout space as if they were generated from the original inputs without the policy in-context. The advantage and policy gradient are then computed on the combined rollouts, indicated by the thick red outlines. PolicyRollout enables more policy-aware exploration without introducing a gap between training and inference, leading to significant improvements in MPI, especially on complex policies.
  • ...and 11 more figures