Table of Contents
Fetching ...

Test-Time Alignment for Large Language Models via Textual Model Predictive Control

Kuang-Da Wang, Teng-Ruei Chen, Yu Heng Hung, Guo-Xun Ko, Shuoyang Ding, Yueh-Hua Wu, Yu-Chiang Frank Wang, Chao-Han Huck Yang, Wen-Chih Peng, Ping-Chun Hsieh

TL;DR

This work reframes test-time alignment of large language models as a sequential decision-making problem and introduces Textual Model Predictive Control (TMPC) to balance horizon and dimensionality challenges. TMPC uses Hindsight Subgoal Identification to discover meaningful intermediate targets and Subgoal-Conditioned Re-Generation to iteratively refine outputs by building on proven successes, all without updating model parameters. The approach is instantiated with a predictive planning loop and evaluated on paragraph-level machine translation, long-form response generation, and program synthesis, where it consistently outperforms strong baselines. The results suggest that test-time predictive planning can achieve robust, generalizable alignment across diverse tasks, with practical implications for safer, more controllable LLM deployments.

Abstract

Aligning Large Language Models (LLMs) with human preferences through finetuning is resource-intensive, motivating lightweight alternatives at test time. We address test-time alignment through the lens of sequential decision making, a perspective that reveals two fundamental challenges. When actions are defined at the token level, as in guided decoding, alignment suffers from the curse of horizon. Conversely, when actions are at the response level, as in traditional iterative refinement, the curse of dimensionality emerges. To resolve this trade-off, we draw inspiration from Model Predictive Control (MPC) in control theory to propose Textual Model Predictive Control (TMPC), a novel predictive planning framework adapted for aligning LLMs at inference time. A key limitation of standard MPC is its reliance on predefined, hard segment boundaries, which are often absent in text generation. TMPC overcomes this by introducing two principles inspired by hierarchical reinforcement learning: (1) Hindsight Subgoal Identification, where TMPC analyzes generation subgoals to retrospectively identify high-reward intermediate outputs as subgoals. This allows the framework to discover meaningful, task-specific planning steps (e.g., a sentence in machine translation or a bug fix in code generation.). (2) Subgoal-Conditioned Re-Generation, where these identified subgoals are used to guide subsequent planning iterations. By conditioning on these proven, high-quality subgoals, TMPC ensures stable improvement by building upon previously validated successes. TMPC is evaluated on three tasks with distinct segmentation properties: discourse-level translation, long-form response generation, and program synthesis. The results demonstrate that TMPC consistently improves performance, highlighting the generality.

Test-Time Alignment for Large Language Models via Textual Model Predictive Control

TL;DR

This work reframes test-time alignment of large language models as a sequential decision-making problem and introduces Textual Model Predictive Control (TMPC) to balance horizon and dimensionality challenges. TMPC uses Hindsight Subgoal Identification to discover meaningful intermediate targets and Subgoal-Conditioned Re-Generation to iteratively refine outputs by building on proven successes, all without updating model parameters. The approach is instantiated with a predictive planning loop and evaluated on paragraph-level machine translation, long-form response generation, and program synthesis, where it consistently outperforms strong baselines. The results suggest that test-time predictive planning can achieve robust, generalizable alignment across diverse tasks, with practical implications for safer, more controllable LLM deployments.

Abstract

Aligning Large Language Models (LLMs) with human preferences through finetuning is resource-intensive, motivating lightweight alternatives at test time. We address test-time alignment through the lens of sequential decision making, a perspective that reveals two fundamental challenges. When actions are defined at the token level, as in guided decoding, alignment suffers from the curse of horizon. Conversely, when actions are at the response level, as in traditional iterative refinement, the curse of dimensionality emerges. To resolve this trade-off, we draw inspiration from Model Predictive Control (MPC) in control theory to propose Textual Model Predictive Control (TMPC), a novel predictive planning framework adapted for aligning LLMs at inference time. A key limitation of standard MPC is its reliance on predefined, hard segment boundaries, which are often absent in text generation. TMPC overcomes this by introducing two principles inspired by hierarchical reinforcement learning: (1) Hindsight Subgoal Identification, where TMPC analyzes generation subgoals to retrospectively identify high-reward intermediate outputs as subgoals. This allows the framework to discover meaningful, task-specific planning steps (e.g., a sentence in machine translation or a bug fix in code generation.). (2) Subgoal-Conditioned Re-Generation, where these identified subgoals are used to guide subsequent planning iterations. By conditioning on these proven, high-quality subgoals, TMPC ensures stable improvement by building upon previously validated successes. TMPC is evaluated on three tasks with distinct segmentation properties: discourse-level translation, long-form response generation, and program synthesis. The results demonstrate that TMPC consistently improves performance, highlighting the generality.

Paper Structure

This paper contains 46 sections, 5 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Textual Model Predictive Control (TMPC) balances the curse of horizon in guided decoding against the curse of dimensionality in naive iterative refinement. It employs Hindsight Subgoal Identification to dynamically discover promising states from rollouts and Subgoal-Conditioned Re-Generation to guide the search from these discovered subgoals, ensuring a stable alignment.
  • Figure 2: TMPC adapts the MPPI framework for test-time alignment by introducing two core principles. Hindsight Subgoal Identification: After generating multiple rollouts, the planner's aggregation function $\mathcal{G}$ selects a subset of locally-optimal actions $\widetilde{\boldsymbol{a}}^{\text{TMPC}}$. This executed plan is retrospectively identified as a high-quality subgoal and stored in a buffer $\mathcal{B}$ if its utility meets a threshold $\alpha$. Subgoal-Conditioned Re-Generation: New rollouts are generated by sampling from and composing subgoals in the buffer $\mathcal{B}$. This allows the planner to iteratively refine the full-horizon plan by building upon the best strategies discovered in previous iterations.
  • Figure 3:
  • Figure 4: The pass rates on MBPP.
  • Figure 5: Robustness and sensitivity analysis of TMPC. (a) Robustness to hyperparameter choices, with performance varying by less than 0.1 points across different buffer and segment sizes. (b) Robustness to imperfections in the reward signal, including both injected noise and lower accuracy. (c) SEGALEcomet scores across iterations on zh→en translation. The standard TMPC steadily improves with more iterations, while a degraded version mimicking naive iterative refinement stagnates.
  • ...and 7 more figures