Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models

Lit Sin Tan; Junzhe Chen; Xiaolong Fu; Lichen Ma; Junshi Huang; Jianzhong Shi; Yan Li; Lijie Wen

Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models

Lit Sin Tan, Junzhe Chen, Xiaolong Fu, Lichen Ma, Junshi Huang, Jianzhong Shi, Yan Li, Lijie Wen

Abstract

Existing test-time scaling (TTS) methods for unified multimodal models (UMMs) in text-to-image (T2I) generation primarily rely on search or sampling strategies that produce only instance-level improvements, limiting the ability to learn from prior inferences and accumulate knowledge across similar prompts. To overcome these limitations, we propose Meta-TTRL, a metacognitive test-time reinforcement learning framework. Meta-TTRL performs test-time parameter optimization guided by model-intrinsic monitoring signals derived from the meta-knowledge of UMMs, achieving self-improvement and capability-level improvement at test time. Extensive experiments demonstrate that Meta-TTRL generalizes well across three representative UMMs, including Janus-Pro-7B, BAGEL, and Qwen-Image, achieving significant gains on compositional reasoning tasks and multiple T2I benchmarks with limited data. We provide the first comprehensive analysis to investigate the potential of test-time reinforcement learning (TTRL) for T2I generation in UMMs. Our analysis further reveals a key insight underlying effective TTRL: metacognitive synergy, where monitoring signals align with the model's optimization regime to enable self-improvement.

Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models

Abstract

Paper Structure (22 sections, 9 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 9 equations, 6 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Unified Multimodal Models (UMMs).
Test-Time Reinforcement Learning (TTRL).
Methodology
Two-level Metacognitive Architecture
Monitoring--Control Loop
Step 1: Meta-level Rubric Construction
Step 2: Object-level Text-to-Image Generation
Step 3: Meta-level Intrinsic Monitoring
Step 4: Meta-to-Object Policy Control via Intrinsic Signal
Experiments
Experimental Setup
Main Results
Analysis
...and 7 more sections

Figures (6)

Figure 1: Test-Time Reinforcement Learning in UMMs.
Figure 2: Meta-TTRL operates with a meta-level introspector that constructs rubric and provides intrinsic monitoring signals, and an object-level generator that produces candidate images. These intrinsic signals are aggregated as rewards to guide policy optimization and improve T2I generation. For clarity, the figure omits the question index $m$ and shows only representative verification questions.
Figure 3: Qualitative case studies of Meta-TTRL
Figure 4: Performance comparison between Meta-TTRL (blue) and E-TTRL (red) on T2I-CompBench++ across eight subdimensions.
Figure 5: Performance comparison of Baseline, RL Leakage with UnifiedReward, and Meta-TTRL across three T2I benchmarks. Left: Radar plot comparing performance on T2I-CompBench++. Right: Bar charts showing results on TIIF-Bench and DPG-Bench.
...and 1 more figures

Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models

Abstract

Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models

Authors

Abstract

Table of Contents

Figures (6)