Table of Contents
Fetching ...

Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models

Lit Sin Tan, Junzhe Chen, Xiaolong Fu, Lichen Ma, Junshi Huang, Jianzhong Shi, Yan Li, Lijie Wen

Abstract

Existing test-time scaling (TTS) methods for unified multimodal models (UMMs) in text-to-image (T2I) generation primarily rely on search or sampling strategies that produce only instance-level improvements, limiting the ability to learn from prior inferences and accumulate knowledge across similar prompts. To overcome these limitations, we propose Meta-TTRL, a metacognitive test-time reinforcement learning framework. Meta-TTRL performs test-time parameter optimization guided by model-intrinsic monitoring signals derived from the meta-knowledge of UMMs, achieving self-improvement and capability-level improvement at test time. Extensive experiments demonstrate that Meta-TTRL generalizes well across three representative UMMs, including Janus-Pro-7B, BAGEL, and Qwen-Image, achieving significant gains on compositional reasoning tasks and multiple T2I benchmarks with limited data. We provide the first comprehensive analysis to investigate the potential of test-time reinforcement learning (TTRL) for T2I generation in UMMs. Our analysis further reveals a key insight underlying effective TTRL: metacognitive synergy, where monitoring signals align with the model's optimization regime to enable self-improvement.

Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models

Abstract

Existing test-time scaling (TTS) methods for unified multimodal models (UMMs) in text-to-image (T2I) generation primarily rely on search or sampling strategies that produce only instance-level improvements, limiting the ability to learn from prior inferences and accumulate knowledge across similar prompts. To overcome these limitations, we propose Meta-TTRL, a metacognitive test-time reinforcement learning framework. Meta-TTRL performs test-time parameter optimization guided by model-intrinsic monitoring signals derived from the meta-knowledge of UMMs, achieving self-improvement and capability-level improvement at test time. Extensive experiments demonstrate that Meta-TTRL generalizes well across three representative UMMs, including Janus-Pro-7B, BAGEL, and Qwen-Image, achieving significant gains on compositional reasoning tasks and multiple T2I benchmarks with limited data. We provide the first comprehensive analysis to investigate the potential of test-time reinforcement learning (TTRL) for T2I generation in UMMs. Our analysis further reveals a key insight underlying effective TTRL: metacognitive synergy, where monitoring signals align with the model's optimization regime to enable self-improvement.
Paper Structure (22 sections, 9 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 9 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Test-Time Reinforcement Learning in UMMs.
  • Figure 2: Meta-TTRL operates with a meta-level introspector that constructs rubric and provides intrinsic monitoring signals, and an object-level generator that produces candidate images. These intrinsic signals are aggregated as rewards to guide policy optimization and improve T2I generation. For clarity, the figure omits the question index $m$ and shows only representative verification questions.
  • Figure 3: Qualitative case studies of Meta-TTRL
  • Figure 4: Performance comparison between Meta-TTRL (blue) and E-TTRL (red) on T2I-CompBench++ across eight subdimensions.
  • Figure 5: Performance comparison of Baseline, RL Leakage with UnifiedReward, and Meta-TTRL across three T2I benchmarks. Left: Radar plot comparing performance on T2I-CompBench++. Right: Bar charts showing results on TIIF-Bench and DPG-Bench.
  • ...and 1 more figures